Skin cancer remains one of the most prevalent and rapidly rising malignancies worldwide, emphasizing the need for accurate and early detection through automated diagnostic tools. Deep learning has significantly advanced dermoscopic image analysis, with Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) emerging as two dominant paradigms. CNNs excel at capturing fine-grained local texture patterns such as pigment networks, color variations, and border irregularities, while ViTs leverage self-attention mechanisms to model long-range global dependencies and holistic lesion structures. This review provides an in-depth examination of CNN-based and ViT-based skin cancer classifiers, discussing their architectural principles, feature extraction capabilities, performance trends, and suitability for real-world clinical settings. We analyze key publicly available skin cancer datasets, preprocessing pipelines, training strategies, and evaluation metrics commonly used with these models. Furthermore, we highlight the complementary strengths of CNNs and ViTs, assess recent hybrid architectures that integrate local and global feature learning, and discuss challenges related to data imbalance, domain variability, computation cost, and model interpretability. The review concludes by outlining future research opportunities toward developing robust, transparent, and clinically reliable AI systems using CNN, ViT, and hybrid approaches for skin cancer diagnosis.
Introduction
Skin cancer is one of the fastest-rising cancers worldwide, with millions of new cases reported annually. While basal cell carcinoma (BCC) and squamous cell carcinoma (SCC) are the most common types, melanoma—though less frequent—causes the majority of skin cancer deaths due to its aggressive metastasis. Increasing UV exposure, lifestyle changes, genetic risks, and limited access to screening contribute to the growing global burden. Early detection is critical, yet traditional dermoscopic diagnosis is subjective and depends heavily on clinician expertise, often leading to misdiagnosis and inconsistent evaluations. The shortage of dermatologists further highlights the need for scalable, objective diagnostic tools.
Deep learning has emerged as a transformative solution in automated skin cancer classification. Convolutional Neural Networks (CNNs) effectively extract local visual patterns such as textures, edges, and color variations, while Vision Transformers (ViTs) capture global structural relationships through self-attention mechanisms. Hybrid CNN-Transformer architectures, self-supervised learning, and multimodal models integrating metadata or clinical notes have pushed performance toward dermatologist-level accuracy. However, challenges remain, including dataset imbalance, domain shift, limited explainability, and barriers to clinical deployment.
The text reviews fundamental skin cancer characteristics, clinical diagnostic rules (like ABCD and the seven-point checklist), and major dermoscopic datasets that fuel deep learning research. These include the ISIC Archive and its challenge datasets (2017–2019), HAM10000, PH2, Derm7pt, and MED-NODE. Each dataset varies in size, diversity, annotation richness, and suitability for tasks such as segmentation, lesion classification, and clinical interpretability.
Key preprocessing techniques (color correction, noise reduction, artifact removal), data augmentation strategies (geometric transformations, color shifts, MixUp, CutMix, and GAN-based image synthesis), and lesion segmentation methods (U-Net, ResUNet, transformer-based models) are outlined as essential steps for improving model robustness and generalization.
A detailed comparison shows that CNNs excel at capturing fine-grained local features, while Vision Transformers provide superior modeling of global lesion structure. Both approaches have strengths and limitations, and current research trends often combine them to exploit complementary advantages. Related works demonstrate high performance using various attention mechanisms, multi-scale feature extraction, and class-imbalance handling strategies.
Conclusion
Skin cancer remains a major global health challenge, and early detection is crucial for improving survival outcomes, particularly for aggressive forms such as melanoma. Deep learning has transformed the landscape of dermoscopic image analysis by enabling automated, accurate, and scalable diagnostic systems that increasingly approach dermatologist-level performance.
Through this review, we summarized key skin cancer types, widely used datasets, preprocessing pipelines, and state-of-the-art DL architectures ranging from CNNs and Vision Transformers to hybrid, multimodal, and self-supervised frameworks. While significant progress has been made, real-world deployment is still hindered by issues such as class imbalance, limited skin tone diversity, domain shift, explainability gaps, and ethical concerns. Future advancements such as lightweight mobile AI, federated learning, multimodal fusion, diffusion-based augmentation, improved interpretability, and large dermatology foundation models hold considerable promise for bridging the gap between research and clinical practice. Ultimately, the development of robust, transparent, and equitable AI systems will be essential to support dermatologists, enhance early diagnosis, and enable widespread, accessible skin cancer screening across diverse populations.
References
[1] Younesi, A., Ansari, M., Fazli, M., Ejlali, A., Shafique, M., Henkel, J.: A Comprehensive Survey of Convolutions in Deep Learning: Applications, Challenges, and Future Trends. IEEE Access. 12, 41180–41218 (2024). https://doi.org/10.1109/ACCESS.2024.3376441.
[2] Hu, Z., Mei, W., Chen, H., Hou, W.: Multi-scale feature fusion and class weight loss for skin lesion classification. Comput. Biol. Med. 176, (2024). https://doi.org/10.1016/j.compbiomed.2024.108594.
[3] Girepunje, S., Singh, P.: ECAM-Net: A Lightweight Convolutional Neural Network Enhanced with Efficient Channel Attention for Lung Cancer Classification. Procedia Comput. Sci. 260, 126–133 (2025). https://doi.org/10.1016/j.procs.2025.03.185.
[4] Girepunje, S., Singh, P.: A Transfer Learning-Based DenseNet with Multi kernel aggregated Channel Attention Block for COVID-19 Classification using CT scan images. 2024 OPJU Int. Technol. Conf. Smart Comput. Innov. Adv. Ind. 4.0, OTCON 2024. 1–6 (2024).
https://doi.org/10.1109/OTCON60325.2024.10687731.
[5] Annappa Wei-Chiang Hong Dasari Haritha Lavanya Devi Editors, B.G.: Lecture Notes in Electrical Engineering 1262 High Performance Computing, Smart Devices and Networks Select Proceedings of CHSN 2023. (2023).
[6] Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y., Tao, D.: A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45, 87–110 (2023). https://doi.org/10.1109/TPAMI.2022.3152247.
[7] de Santana Correia, A., Colombini, E.L.: Attention, please! A survey of neural attention models in deep learning. Springer Netherlands (2022). https://doi.org/10.1007/s10462-022-10148-x.
[8] Hassanin, M., Anwar, S., Radwan, I., Khan, F.S., Mian, A.: Visual Attention Methods in Deep Learning: An In-Depth Survey. 1–20 (2022).
[9] Park, J., Woo, S., Lee, J.Y., Kweon, I.S.: BAM: Bottleneck attention module. Br. Mach. Vis. Conf. 2018, BMVC 2018. (2019).
[10] Liu, Y., Shao, Z., Teng, Y., Hoffmann, N.: NAM: Normalization-based Attention Module. (2021).
[11] Qiao, Z., Yuan, X., Zhuang, C., Meyarian, A.: Attention pyramid module for scene recognition. Proc. - Int. Conf. Pattern Recognit. 7521–7528 (2020). https://doi.org/10.1109/ICPR48806.2021.9412235.
[12] Adil, N., Singh, P., Nagwani, N.K.: Interpretable Lightweight CNN for Colon and Lung Cancer Classification with LIME Based Explainability. Proc. - 2024 IEEE Int. Conf. Intell. Syst. Smart Green Technol. ICISSGT 2024. 122–127 (2024). https://doi.org/10.1109/ICISSGT58904.2024.00034.
[13] Tiwari, A., Ghose, A., Hasanova, M., Faria, S.S., Mohapatra, S., Adeleke, S., Boussios, S.: The current landscape of artificial intelligence in computational histopathology for cancer diagnosis. Discov. Oncol. 16, (2025). https://doi.org/10.1007/s12672-025-02212-z.
[14] Jain, A., Singh, P.: Optimized Lightweight Network for Enhanced Image Denoising. 2025 IEEE Int. Conf. Interdiscip. Approaches Technol. Manag. Soc. Innov. IATMSI 2025. 3, 1–6 (2025). https://doi.org/10.1109/IATMSI64286.2025.10984546.
[15] Kumar, A., Singh, D.: Generative Adversarial Network-Based Augmentation with Noval 2-step Authentication for Anti-coronavirus Peptide Prediction. IEEE/ACM Trans. Comput. Biol. Bioinforma. 21, 1942–1954 (2024). https://doi.org/10.1109/TCBB.2024.3431688.
[16] Ding, Y., Ma, Z., Wen, S., Xie, J., Chang, D., Si, Z., Wu, M., Ling, H.: AP-CNN: Weakly Supervised Attention Pyramid Convolutional Neural Network for Fine-Grained Visual Classification. IEEE Trans. Image Process. 30, 2826–2836 (2021). https://doi.org/10.1109/TIP.2021.3055617.
[17] Wei, Z., Li, Q., Song, H.: Dual attention based network for skin lesion classification with auxiliary learning. Biomed. Signal Process. Control. 74, 103549 (2022). https://doi.org/10.1016/j.bspc.2022.103549.
[18] Qian, S., Ren, K., Zhang, W., Ning, H.: Skin lesion classification using CNNs with grouping of multi-scale attention and class-specific loss weighting. Comput. Methods Programs Biomed. 226, 107166 (2022). https://doi.org/10.1016/j.cmpb.2022.107166
[19] Naveed, A., Naqvi, S.S., Khan, T.M., Razzak, I.: PCA: Progressive class-wise attention for skin lesions diagnosis. Eng. Appl. Artif. Intell. 127, 107417 (2024). https://doi.org/10.1016/j.engappai.2023.107417.
[20] Li, A., Zhang, D., Yu, L., Kang, X., Tian, S., Wu, W., You, H., Huo, X.: Residual cosine similar attention and bidirectional convolution in dual-branch network for skin lesion image classification. Eng. Appl. Artif. Intell. 133, 108386 (2024). https://doi.org/10.1016/j.engappai.2024.108386.
[21] Wan, Y., Cheng, Y., Shao, M.: MSLANet: multi-scale long attention network for skin lesion classification. Appl. Intell. 53, 12580–12598 (2023). https://doi.org/10.1007/s10489-022-03320-x.
[22] Wang, L., Zhang, L., Qi, X., Yi, Z.: Deep Attention-Based Imbalanced Image Classification. IEEE Trans. Neural Networks Learn. Syst. 33, 3320–3330 (2022). https://doi.org/10.1109/TNNLS.2021.3051721.