Skin cancer is a global health issue where early diagnosis significantly improves survival rates. Deep learning shows great promise, but most high-performance models require 28 to 88 million parameters, making them too heavy for deployment in remote clinics with limited resources. To address this, we present a novel training-time architectural augmentation framework for lightweight hybrid vision transformers that significantly improves diagnostic performance without increasing the final inference parameter count. We evaluated our approach on the highly imbalanced HAM10000 benchmark, achieving a competitive 88.32% overall accuracy while keeping the final model at only 2.9 million parameters. While larger state-of-the-art architectures achieve higher accuracy, our lightweight configuration surpasses them in balanced metrics like F1-score and Recall, despite having a significantly smaller parameter footprint. It also prevents majority classes from dominating the learning process, boosting accuracy of rare lesions like DF and BCC to 90.91% and 86.54%, respectively. Supported by Grad-CAM visualizations for transparency, this framework bridges the gap between fair performance and practical real-world deployment on edge devices.
Introduction
This paper presents a lightweight and efficient deep learning framework for skin cancer classification using dermoscopy images from the HAM10000 dataset. While existing high-performing models such as Vision Transformers and Swin Transformers achieve strong accuracy, they require 86–88 million parameters, making them unsuitable for deployment on mobile, wearable, or edge medical devices. The study addresses this challenge by combining MobileViTv2-0.75 (only 2.9 million parameters) with BatchFormer, a training-time enhancement that improves performance without increasing inference-time model size.
A major challenge in skin lesion classification is the severe class imbalance in the HAM10000 dataset, where Melanocytic Nevi (NV) account for 66.9% of images, while classes such as Dermatofibroma (DF) and Vascular lesions (VASC) are extremely rare. To address this, the authors employ extensive data augmentation, label smoothing, and macro recall-based checkpoint selection, ensuring better recognition of minority classes.
The proposed architecture uses a pretrained MobileViTv2 backbone to extract image features, while BatchFormer applies self-attention across training batches to enhance feature learning. Importantly, BatchFormer is removed during inference, allowing the deployed model to retain its compact 2.9M-parameter footprint. Three experimental configurations were evaluated, including a baseline model and two BatchFormer-enhanced versions.
Results show that the proposed approach significantly improves performance. Accuracy increased from 84.93% in the baseline model to 88.32%, while Precision, Recall, and F1-score improved to 83.75%, 82.98%, and 83.19%, respectively. Minority-class performance improved substantially, with DF classification accuracy increasing from 72.73% to 90.91%. The study also found that larger batch sizes under BatchFormer further enhanced performance without increasing model complexity.
For interpretability, Grad-CAM visualizations were used to verify that the model focuses on clinically relevant lesion regions rather than background artifacts, improving trustworthiness for medical applications. Compared with larger state-of-the-art models, the proposed framework achieves competitive classification performance while using 30–50 times fewer parameters, making it highly suitable for real-world deployment on resource-constrained healthcare devices.
Conclusion
In this paper, we introduced a novel training-time architectural augmentation framework for hybrid vision transformers that handles highly imbalanced dermoscopy data. This approach improves performance without increasing computational overhead. Specifically, we integrated MobileViTv2 with the BatchFormer module to tackle the HAM10000 dataset. The compact design of MobileViTv2 results in a drop from its baseline performance, but our approach counters this by using a strong data augmentation and BatchFormer pipeline, resulting in a capable diagnostic tool that stays at a 2.9 million parameter footprint.
The proposed setup reached a final accuracy of 88.32%, improving balanced metrics like F1-score, recall, and precision to 83.19%, 82.98%, and 83.75% respectively. By prioritizing fairness across all lesions, our model beats state-of-the-art architectures in F1, recall, and precision with only a small fraction of their parameter count. Overall, these results show that massive and heavy models are not required for precise and explainable skin cancer identification. Future work will include exploring further batch size scaling, testing more advanced augmentation strategies, and expanding evaluation to ISIC 2019 and ISIC 2020.
References
[1] R. L. Siegel, K. D. Miller, H. E. Fuchs, and A. Jemal, “Cancer statistics, 2021,” CA: A Cancer Journal for Clinicians, vol. 71, no. 1, pp. 7–33, Jan. 2021, doi: 10.3322/caac.21654.
[2] J. Kawahara, A. BenTaieb, and G. Hamarneh, “Deep features to classify skin lesions,” in Proc. IEEE ISBI, 2016, pp. 1397–1400, doi: 10.1109/ISBI.2016.7493528.
[3] M. S. I. Sajol, S. T. Alvi, and C. A. A. Era, “Performance assessment of advanced CNN and transformer architectures in skin cancer detection,” in Proc. 11th Int. Conf. on EECSI, 2024, pp. 1–8, doi: 10.1109/EECSI63442.2024.10776508.
[4] P. Tschandl, C. Rosendahl, and H. Kittler, “The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions,” Scientific Data, vol. 5, Art. no. 180161, Aug. 2018, doi: 10.1038/sdata.2018.161.
[5] N. C. F. Codella et al., “Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the ISIC,” 2019. [Online]. Available: https://arxiv.org/abs/1902.03368
[6] S. Mehta and M. Rastegari, “MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer,” in Proc. ICLR, 2022. [Online]. Available: https://openreview.net/forum?id=vh-0sUt8HlG
[7] S. Mehta and M. Rastegari, “Separable self-attention for mobile vision transformers,” 2022. [Online]. Available: https://arxiv.org/abs/2206.02680
[8] Z. Hou, B. Yu, and D. Tao, “BatchFormer: Learning to explore sample relationships for robust representation learning,” in Proc. IEEE CVPR, 2022, pp. 7246–7256, doi: 10.1109/CVPR52688.2022.00711.
[9] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002, doi: 10.1613/jair.953.
[10] M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of the class imbalance problem in convolutional neural networks,” Neural Networks, vol. 106, pp. 249–259, 2018, doi: 10.1016/j.neunet.2018.07.011.
[11] A. Mumuni and F. Mumuni, “Data augmentation: A comprehensive survey of modern approaches,” Array, vol. 16, Art. no. 100258, 2022, doi: 10.1016/j.array.2022.100258.
[12] R. R. Selvaraju et al., “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” in Proc. IEEE ICCV, 2017, pp. 618–626, doi: 10.1109/ICCV.2017.74.
[13] R. Wightman, “PyTorch image models (timm).” [Online]. Available: https://github.com/rwightman/pytorch-image-models. [Accessed Mar. 2026].
[14] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. ICLR, 2019, doi: 10.48550/arXiv.1711.05101.
[15] S. Smith, P. Kindermans, C. Ying, and Q. V. Le, “Don’t decay the learning rate, increase the batch size,” in Proc. ICLR, 2018. [Online]. Available: https://openreview.net/pdf?id=B1Yy1BxCZ