MedGem AI: An Open-Source Multi-Modal Platform for Medical Image Analysis

Authors: Dr. U. M. Patil, Nomesh R. Kirange, Mansi S. Bendale, Tejas R. Jadhav, Krushna D. Patil

DOI Link: https://doi.org/10.22214/ijraset.2026.82787

Abstract

Current healthcare systems around the world experience a tremendous bottleneck since more than 60% of the global population lacks access to prompt interpretations by specialist medical imaging experts. In order to solve this urgent problem, MedGem AI presents a locally installed, HIPAA-compliant platform that enables physicians to get intelligence for decision-making purposes. Using Google\'s sophisticated open-weighted medical AI models, MedGem AI aims to deliver expert insights in the most resource-constrained clinical settings. The core of the technology relies on a hierarchical architecture implemented under the Health AI Developer Foundations (HAI-DEF). Within this ecosystem, MedGemma 4B model takes the center stage and is responsible for interpreting imagery, dealing with visual questions, and automatically creating structured clinical notes. Other models used to accomplish different tasks include CXR Foundation (anomalies identification in chest X-ray images), Derm Foundation (diagnostic interpretation of skin lesions), and Path Foundation (classification of tissues in digital histopathology). To make this system work with regular consumer-grade hardware having at least 8GB of VRAM, NF4 4-bit quantization was used when developing each network. Zero-Footprint Privacy Architecture represents the basis of the solution in terms of data sovereignty and security. To ensure that patient PHI does not go to the cloud, MedGem AI runs all computational workflows on-site. In addition, the developed Privacy Guard automatically removes metadata and PHI from DICOMs to keep this process HIPAA-compliant. To gain physicians\' trust and increase robustness of the solution, MedGem AI implements Explainable AI capabilities by showing attention maps (where the model focuses) and uses Ensemble Fusion Engine for combining model predictions by means of weighted voting.

Introduction

This paper presents MedGem AI, a privacy-focused multimodal medical diagnostic platform built on Google's MedGemma 4B model. The system addresses the global shortage of radiologists and pathologists, which leaves nearly 60% of the world's population without timely access to expert medical image interpretation. Traditional AI systems based on CNNs such as ResNet and DenseNet can detect abnormalities but are limited to classification tasks and cannot explain their reasoning or incorporate patient context.

To overcome these limitations, MedGem AI employs Multimodal Large Language Models (MLLMs) that combine medical image analysis with natural language understanding and reasoning. Unlike conventional models, MedGemma can perform visual question answering, generate detailed clinical reports, and support interactive conversations with healthcare professionals. Its medical-domain training on millions of medical images, clinical notes, and PubMed articles enables superior performance, especially in complex and unusual cases.

The platform is designed to democratize healthcare by running locally on affordable hardware using 4-bit NF4 quantization, allowing deployment in rural and low-resource healthcare facilities while maintaining patient privacy. The architecture integrates specialized domain models for radiology (CXR Foundation), dermatology (Derm Foundation), and pathology (Path Foundation). An Ensemble Fusion Engine combines predictions from these models with MedGemma's reasoning capabilities to improve diagnostic reliability and reduce AI hallucinations.

A key feature of the system is its strong privacy framework. The Privacy Guard module automatically removes patient identifiers, metadata, and embedded annotations from medical images, generates secure audit trails, and ensures that all computations remain local without transmitting data to external servers. This provides compliance with healthcare privacy standards while protecting sensitive patient information.

The platform supports specialized workflows for radiology, dermatology, and pathology, including disease detection, lesion classification, tissue analysis, temporal comparison of medical images, and differential diagnosis generation. To improve transparency, it incorporates Explainable AI (XAI) techniques such as Grad-CAM attention heatmaps, allowing clinicians to visualize the image regions influencing AI decisions.

Experimental evaluations demonstrated successful detection and analysis of conditions such as pleural effusion, cardiomegaly, skin lesions, and colorectal cancer tissues. Hardware optimization enables deployment on systems with as little as 8 GB VRAM, making advanced AI-assisted diagnostics accessible outside high-performance computing environments.

Conclusion

From ResNets, DenseNets, and YOLOs to multimodal generative MedGemmas, we observe an essential paradigm change of computational medicine. Although legacy computer vision models had confirmed their ability to recognize pathologies with outstanding precision, they failed to provide enough contextual understanding and interpretability. MedGem successfully addresses both issues by introducing an algorithm based on vision encoders and a language backbone. Thus, MedGemma becomes an intelligent system capable of generating structured reports that can be easily interpreted by the clinician, becoming his or her active assistant. The efficacy of MedGemma was validated using MedVista AI\'s examples of diagnosing complicated cases of musculoskeletal injuries and histopathology. Moreover, due to the optimization of MedGemmas on consumer-grade devices using quantization techniques, the system achieves the ultimate balance of the power of modern machine learning and the absolute necessity of keeping personal patient data on premise. In this way, MedGem offers an efficient solution for solving the world-wide problem of the shortage of medical experts. For future iterations, we aim at evolving MedGem into an integrated diagnostic system capable of creating an ongoing history of patient\'s disease or injuries. Our priorities for further research would include implementing longitudinal analysis in order to allow for monitoring of disease progression, tumorbehavior, and wound healing dynamics. MedGemma\'s functionality will also include supporting various medical data modalities including imaging, ultrasound videos, genomics, and lab tests. Additionally, developers would be looking into possibilities of sub-4-bit quantization and ultra-lightweights models for implementing advanced medical reasoning on consumer hardware. In terms of applications, MedGemma will evolve into an interactive tool helping clinicians not only create structured reports but also consult them on personalized treatment options based on a cross-reference to hospital protocols and worldwide resources.

References

[1] D. Salomon and G. Motta, Handbook of Data Compression, 5th ed., Springer, 2010. [2] K. Sayood, Introduction to Data Compression, 5th ed., Morgan Kaufmann, 2017. [3] D. Taubman and M. Marcellin, JPEG2000 Image Compression Fundamentals, Springer, 2002. [4] G. K. Wallace, “The JPEG Still Picture Compression Standard,” IEEE Transactions on Consumer Electronics, vol. 38, no. 1, pp. 18–34, 1992. [5] R. Gonzalez and R. Woods, Digital Image Processing, 4th ed., Pearson, 2018. [6] K. Singhal et al., “Large Language Models Encode Clinical Knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, 2023. [7] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024. [8] A. Holzinger, G. Langs, H. Denk, K. Zatloukal, and H. Müller, “Causability and Explainability of Artificial Intelligence in Medicine,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 9, no. 4, 2019. [9] J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation,” in International Conference on Machine Learning (ICML), pp. 12888–12900, 2022. [10] Google DeepMind, “MedGemma Technical Report,” arXiv preprint arXiv:2507.05201, 2025. [11] Gemma Team, Google DeepMind, “Gemma 3 Technical Report,” arXiv preprint arXiv:2503.19786, 2025. [12] G. Zhu, Z. Hou, Z. Liu, Z. Sang, C. Xie, and H. Yang, “InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-training and Multi-stage Fine-tuning,” arXiv preprint arXiv:2509.22261, 2025. [13] Y. C. Shih, “Multimodal Large Language Models for Cystoscopic Image Interpretation and Bladder Lesion Classification: Comparative Study,” PubMed Central (PMC), PMC12895159, 2026. [14] A. A. Buskila, “Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering,” arXiv preprint arXiv:2604.23801, 2026.

Copyright

Copyright © 2026 Dr. U. M. Patil, Nomesh R. Kirange, Mansi S. Bendale, Tejas R. Jadhav, Krushna D. Patil. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET82787

Publish Date : 2026-05-19

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here