Accurate dietary intake assessment remains a challenging task, especially in regions with diverse and visually complex cuisines such as South India, where meals often consist of heterogeneous mixtures and non-standard plating. Traditional nutrition-tracking applications rely heavily on manual logging, leading to inaccurate reporting and reduced long-term user engagement. This paper presents DataBowl, a progressive web- based multimodal framework that automatically estimates nutritional content from a single food image. The system integrates YOLOv8 for fine-grained ingredient detection, a Vision–Language Model (VLM) for contextual dish interpretation, and a Large Language Model (LLM) for nutrient aggregation and reasoning. DataBowl effectively identifies both major dish components and subtle ingredients including groundnuts, chillies, and coriander leaves enabling more precise nutrient profiling. A custom annotated dataset of South Indian dishes was developed for evaluation, and the proposed pipeline achieved an overall accuracy of 85% on this challenging domain. In addition to nutrient estimation, DataBowl provides personalized recipe recommendations through a curated recipe module and maintains a longitudinal history of user uploads to highlight nutrient deficiencies and evolving dietary patterns. Experimental results demonstrate that the multimodal design enhances interpretability, ingredient-level granularity, and real- world usability compared to conventional single-model approaches, positioning DataBowl as a practical tool for personalized diet monitoring and lifestyle management
Introduction
The text presents DataBowl, a multimodal, Progressive Web Application designed to improve dietary nutrient tracking, particularly for complex South Indian cuisines that are difficult to analyze using traditional diet-logging methods. Manual tracking is often time-consuming, inaccurate, and ineffective for mixed dishes containing small but nutritionally significant ingredients such as spices, herbs, and garnishes. Existing automated food-recognition systems also perform poorly because they rely mainly on Western datasets and lack fine-grained ingredient detection.
To address these limitations, DataBowl integrates YOLOv8 for detailed ingredient detection, a Vision-Language Model (VLM) for contextual dish understanding, and a Large Language Model (LLM) for nutrient estimation and aggregation. This multimodal framework enables automatic identification of both major and minor ingredients from a single food image, interprets dish context, and estimates macronutrients and micronutrients without requiring user input. A custom-annotated dataset of South Indian dishes supports improved regional generalization.
The system architecture consists of layered components for input handling, processing, user interaction, data storage, and output visualization. It provides real-time nutrient summaries, ingredient overlays, personalized recipe recommendations, and longitudinal dietary analytics through a cross-platform PWA.
Experimental results show strong performance, with 85% overall accuracy, high precision and recall in ingredient recognition, and low inference latency (~220 ms), making the system suitable for real-time use. Most errors occur in visually similar liquid dishes and overlapping ingredients. Overall, the study demonstrates that a unified multimodal approach significantly outperforms single-model systems in nutrient estimation for culturally diverse meals, offering a practical and scalable solution for accurate dietary monitoring.
Conclusion
The proposed multimodal system successfully meets its core objective of providing accurate, explainable nutrient estimation from a single food image by tightly integrating YOLOv8-based ingredient detection, Vision-Language Models for contextual understanding, and Large Language Models for nutrition reasoning within a Progressive Web App framework. With an overall accuracy of 85%, strong ingredient-wise precision, recall, and F1-score, and competitive mAP values for the YOLOv8 module, the system demonstrates that complex Indian meals with multiple ingredients can be analyzed in a practical, near-real-time setting while still offering user-friendly interaction and personalized dietary feedback. Beyond raw metrics, the project establishes a scalable, modular architecture that can be extended to richer South Indian meal scenarios, larger and more diverse datasets, and deeper personalization such as condition-specific diet guidance and longitudinal tracking. By bridging computer vision, multimodal learning, and nutrition informatics in a deployable PWA, the work provides a solid foundation for future research and real-world deployment in clinical, wellness, and everyday lifestyle applications, particularly for culturally diverse and visually complex cuisines.
References
[1] M. Bossard, L. Y. Canziani and L. Van Gool, “Food-101 – Mining Discriminative Components with Random Forests,” European Conference on Computer Vision, 2014.
[2] Y. Kawano and K. Yanai, “Automatic Expansion of a Food Image Dataset Leveraging Existing Categories with Domain Adaptation,” International Conference on Multimedia & Expo (ICME), 2014.
[3] J. Chen and C. Ngo, “Deep-based Ingredient Recognition for Cooking Recipe Retrieval,” ACM Multimedia, 2016.
[4] Y. Meyers et al., “Im2Calories: Towards an Automated System for Image-based Calorie Estimation,” IEEE International Conference on Computer Vision (ICCV), 2015.
[5] S. Beijbom, V. Jayasumana, T. Alsheikh and F. Sha, “Menu-Match: Restaurant-specific Food Recognition for Calorie Estimation,” IEEE WACV, 2015.
[6] G. Salvador et al., “Learning Cross-modal Embeddings for Cooking Recipes and Food Images,” IEEE CVPR, 2017.
[7] Y. Wu, F. Zhu and M. Tan, “Ingredient Recognition and Recipe Analysis using Multi-task Neural Networks,” IEEE TPAMI, vol. 43, no. 9, pp. 3111–3124, 2021.
[8] Y. Ma, H. Yang, F. Zhu, “UMDFood-VL: Vision-Language Models for Food Composition Compilation,” arXiv preprint, arXiv:2306.01747, 2023.
[9] Z. Yin et al., “FoodLMM: A Food Assistant using Large Multimodal Models,” arXiv preprint, arXiv:2312.14991, 2023.
[10] G. C. Utami, S. Widodo and A. Nugraha, “Detection of Indonesian Food to Estimate Nutritional Information Using YOLOv5,” Teknika, vol. 10, no. 2, pp. 221–232, 2023.
[11] S. Romadhon et al., “Food Image Detection System and Calorie Content Estimation Using YOLO to Control Calorie Intake,” ResearchGate Preprint, 2023.
[12] A. Purwati et al., “Computer Vision for Food Nutrition Assessment: A Review,” Journal of Research in Community, vol. 7, no. 1, pp. 45–57, 2024.
[13] J. He, H. Zhang and L. Wang, “A Multi-label Learning Method for Food Ingredient Recognition,” Pattern Recognition, vol. 128, 2022.
[14] H. Chen, X. Jin and X. Wang, “Fine-grained Food Recognition with Attention Mechanisms,” IEEE Access, vol. 10, pp. 22491–22503, 2022.
[15] A. Talavera et al., “Food Volume Estimation and Macronutrient Analysis using Deep Segmentation Networks,” IEEE ICIP, 2020.
[16] O. Ripamonti et al., “A Mobile System for Real-time Food Recognition and Nutrition Estimation,” IEEE International Symposium on Multimedia, 2019.
[17] J. Mezgec and B. Korouši? Seljak, “NutriNet: A Deep Learning Food and Drink Image Recognition System for Dietary Assessment,” Nutrients, vol. 9, no. 5, 2017.
[18] A. Ciocca and G. Napoletano, “Food Recognition: A Comprehensive Survey,” IEEE Access, vol. 8, pp. 209561–209577, 2020.
[19] L. Wang et al., “Cross-modal Recipe Retrieval using Embedding Networks,” IEEE CVPR Workshops, 2019.
[20] R. Min, D. Chiu and Y. Chen, “A Comprehensive Food Recognition System Using Transformer-based Vision Models,” IEEE Transactions on Multimedia, vol. 26, pp. 1458–1472, 2024.