Access to timely and accurate healthcare information remains a defining and deeply troubling inequity in India\'s linguistically diverse, resource-constrained rural and semi-urban communities — one that this research seeks, in measured but meaningful ways, to address. This paper presents a WhatsApp-based Multimodal AI Triage System (SIH Problem Statement: SIH-25049) that integrates Convolutional Neural Networks (CNNs), Natural Language Processing (NLP), and Explainable Artificial Intelligence (XAI) to deliver real-time diagnostic support in English, Hindi, and Telugu the three languages we identified as reaching the broadest underserved population in our target deployment context. The system accepts symptom photographs and text-based queries through the Twilio WhatsApp API, routing them through two specialized EfficientNet-B0 deep learning pipelines OcularNet, developed for ocular disease detection across three classes (Cataract, Diabetic Retinopathy, Normal), and SkinNet, designed for dermatological classification across five classes (Acne, Melanoma, Scaly Lesions, Vitiligo, Warts) — and generates verified medical advisories from a curated multilingual knowledge base. We incorporate Grad-CAM heatmaps at 40% overlay transparency not merely as a technical feature, but as a deliberate commitment to clinical transparency and patient trust. OcularNet achieved 70.2% validation accuracy (+34.9% over random baseline) and SkinNet achieved 70.23% accuracy (3× above random baseline), with end-to-end response latency of 5.2–6.0 seconds on an NVIDIA RTX 4050 GPU. We believe the system\'s most consequential design decision is also its simplest: by operating entirely within WhatsApp, it demands nothing extra from the very people who already have the least — no new applications, no new devices, no new barriers.
Introduction
The paper addresses India’s major healthcare accessibility gap, especially in rural regions where doctor availability is low and language diversity limits the usability of most digital health tools. It proposes a WhatsApp-based multimodal AI system that provides accessible, real-time medical assistance without requiring app installation or English proficiency.
The system integrates image-based disease detection and multilingual text advice (English, Hindi, Telugu) using two deep learning models—OcularNet for eye diseases and SkinNet for skin conditions—built on EfficientNet-B0. Users send images via WhatsApp, and the backend (Flask + Twilio + Ngrok) processes them asynchronously to return diagnosis, confidence scores, and treatment guidance.
To improve transparency, the system uses Grad-CAM explainability, which highlights image regions influencing the prediction, helping users trust the results. A multilingual advisory engine then provides context-aware medical recommendations from a verified knowledge base.
The paper also reviews related work in multimodal AI, medical chatbots, and explainable AI, highlighting that most existing systems are either unimodal, non-transparent, or not accessible to rural users. The proposed system combines multimodal diagnosis, explainability, multilingual support, and WhatsApp-based delivery into a single platform aimed at improving healthcare accessibility.
Experiments show moderate classification performance (around 70% accuracy), with some confusion between visually similar disease classes, but overall demonstrate feasibility for real-world deployment.
Conclusion
Our paper represents a system which can be accessed on whatsapp make hardware barrier obsolete By using whatsapp as a platform it removes adaptation barriers . the system has top features like Grad-CAM,explainabilty and multilingual chats and image based disease detection. The system is easily deployable and cost-effective for user. It helps in promoting preventive health care in rural areas and semi-urban places. The system has two model with 70% over all accuracy . the model where trained on openly available limited medical dataset on various skin and eye diseases,the system also offers text chat support with offline RAG mode for serious queries and Online mode for general and current affairs. The future add on like text reader that reads the responses in multiple languages, expanding disease library by training model on more cases. Implementing tier model which connects multiple model to detect complex diseases with numerous symptoms and lastly expanding knowledge based for more offline RAG capability.
References
[1] D. Kavya, D. Kiran Kumar, M. Divya Anjali, and A. P. Ganesh, \"Chatbot for multilingual healthcare environment using Bio-BERT,\" Int. J. Res. Appl. Sci. Eng. Technol., 2025.
[2] S. Badlani, T. Aditya, M. Dave, and S. Chaudhari, \"Multilingual healthcare chatbot using machine learning,\" in Proc. 2021 2nd Int. Conf. Emerging Technol. (INCET), Belgaum, India, May 2021.
[3] B. D. Simon, K. B. Ozyoruk, D. G. Gelikman, S. A. Harmon, and B. Türkbey, \"The future of multimodal artificial intelligence models for integrating imaging and clinical metadata: a narrative review,\" Diagn. Interv. Radiol., vol. 31, no. 4, pp. 303–312, 2025.
[4] R. R. Selvaraju et al., “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” in Proc. IEEE ICCV, Venice, Italy, 2017, pp. 618–626.
[5] S. Muneer, T. M. Ghazal, T. Alyas, M. A. Raza, S. Abbas, O. AlZoubi, and O. Ali, \"Explainable AI-driven chatbot system for heart disease prediction using machine learning,\" Int. J. Adv. Comput. Sci. Appl. (IJACSA), vol. 15, no. 12, 2024.
[6] World Health Organization, “WHO Global Strategy on Digital Health 2020–2025,” Geneva, Switzerland, Tech. Rep., 2021.
[7] Y. Zhu, X. Yin, A. Wee-Chung Liew, and H. Tian, \"Privacy-preserving in medical image analysis: a review of methods and applications,\" in Lecture Notes in Computer Science, vol. 15502, Springer, Singapore, 2025.
[8] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proc. ICML, Long Beach, CA, 2019, pp. 6105–6114.
[9] V. Gulshan et al., “Development and validation of a deep learning algorithm for detection of diabetic retinopathy,” JAMA, vol. 316, no. 22, pp. 2402–2410, 2016.
[10] A. Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, pp. 115–118, 2017.
[11] Twilio Inc., “Twilio WhatsApp API Documentation,” 2024.
[12] B. P. Cabral, L. A. M. Braga, C. G. Conte Filho, B. Penteado, S. L. F. de Castro Silva, L. Castro, M. Fornazin, and F. Mota, \"Future use of AI in diagnostic medicine: 2-wave cross-sectional survey study,\" J. Med. Internet Res., vol. 27, p. e53892, Feb. 2025.
[13] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, \"BioBERT: a pre-trained biomedical language representation model for biomedical text mining,\" Bioinformatics, vol. 36, no. 4, pp. 1234–1240, Feb. 2020,
[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, \"BERT: Pre-training of deep bidirectional transformers for language understanding,\" in Proc. 2019 Conf. North American Chapter Association Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, Jun. 2019, pp. 4171–4186.
[15] E. J. Topol, \"High-performance medicine: the convergence of human and artificial intelligence,\" Nat. Med., vol. 25, no. 1, pp. 44–56, Jan. 2019,
[16] S. M. Lundberg and S.-I. Lee, \"A unified approach to interpreting model predictions,\" in Proc. 31st Int. Conf. Neural Information Processing Systems (NIPS), Long Beach, CA, 2017, pp. 4765–4774.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, ?. Kaiser, and I. Polosukhin, \"Attention is all you need,\" in Proc. 31st Int. Conf. Neural Information Processing Systems (NIPS), Long Beach, CA, 2017, pp. 5998–6008.
[18] M. T. Ribeiro, S. Singh, and C. Guestrin, \"\'Why should I trust you?\': Explaining the predictions of any classifier,\" in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, Aug. 2016, pp. 1135–1144,
[19] B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, \"Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis,\" IEEE J. Biomed. Health Inform., vol. 22, no. 5, pp. 1589–1604, Sep. 2018,
[20] Z. Obermeyer and E. J. Emanuel, \"Predicting the future — big data, machine learning, and clinical medicine,\" N. Engl. J. Med., vol. 375, no. 13, pp. 1216–1219, Sep. 2016,