Skin cancer incidence continues to rise globally, with melanoma presenting significant mortality risks if not detected early. Early and accurate diagnosis is critical for effective treatment; yet conventional dermoscopic assessment remains subjective and prone to inter-observer variability. To address these challenges, HyCoT-Net is proposed in this article. A novel hybrid deep learning framework that integrates a CNN-based Local Texture Encoder (LTE) and a Transformer-based Global Context Encoder (GCE) through an Adaptive Fusion Module (AFM). The LTE captures fine-grained morphological features such as pigment networks, dots, globules, and streaks, while the GCE models long-range dependencies and global lesion structure. The AFM dynamically learns per-image importance weights, adaptively balancing the contributions of local and global representations according to lesion characteristics. This approach enables the network to effectively handle high intra-class variability and inter-class similarity commonly present in dermoscopic images. HyCoT-Net was evaluated on the ISIC 2019 dataset, containing 25,331 dermoscopic images across eight clinically significant lesion categories. Extensive experiments demonstrate that the proposed model outperforms state-of-the-art CNNs, Transformers, and conventional hybrid methods, achieving an accuracy of 95.35%. Grad-CAM++ visualization further confirms the model’s ability to selectively focus on clinically relevant regions, enhancing interpretability. The results indicate that adaptive feature fusion provides robust and generalizable representations, improving classification reliability for automated skin cancer screening. Overall, HyCoT-Net presents a promising tool for supporting dermatologists in early detection, offering both high predictive performance and clinical relevance.
Introduction
Skin cancer is one of the fastest-growing malignancies worldwide, with melanoma being particularly dangerous due to its high metastatic potential and mortality rate when detected late. While non-melanoma cancers such as basal cell carcinoma and squamous cell carcinoma are more common, early detection remains critical for all types to improve survival rates and reduce invasive treatments.
Diagnosis primarily relies on:
Visual inspection
Dermoscopy (a non-invasive imaging technique enhancing lesion structures)
Although dermoscopy improves diagnostic accuracy, it remains subjective and dependent on clinician expertise. Variability, artifacts (e.g., hairs, bubbles), and overlapping visual features between benign and malignant lesions highlight the need for reliable Computer-Aided Diagnosis (CAD) systems.
2. Challenges in Automated Skin Lesion Classification
The ISIC 2019 dataset provides a large benchmark for automated skin cancer detection, containing 25,331 dermoscopic images across eight lesion categories. However, classification is difficult due to:
High intra-class variability (same class looks different)
High inter-class similarity (different classes look similar)
Variations in color, texture, size, shape, and imaging conditions
Presence of artifacts
Traditional CNNs capture local texture features well but struggle with global structural patterns (e.g., asymmetry, border irregularity). Vision Transformers (ViTs) model long-range dependencies effectively but:
Require large datasets
May miss subtle local features
Are sensitive to noise
Existing hybrid models combine CNNs and Transformers but often use simple feature concatenation, failing to adapt dynamically to lesion-specific characteristics.
3. Proposed Solution: HyCoT-Net
The study introduces HyCoT-Net (Hybrid CNN-Transformer Network), a novel architecture designed to dynamically balance local and global feature extraction.
Architecture Components
1. Local Texture Encoder (LTE)
Based on MobileNetV2 (CNN)
Extracts fine-grained features:
Pigment networks
Dots
Globules
Streaks
Focuses on high-frequency local patterns
2. Global Context Encoder (GCE)
Based on MobileViT (Transformer)
Captures:
Long-range dependencies
Lesion asymmetry
Border irregularity
Global color distribution
3. Adaptive Fusion Module (AFM)
The key innovation of the model.
Unlike fixed fusion strategies, AFM:
Learns per-image importance weights
Dynamically determines how much the model should rely on:
CNN (local texture features)
Transformer (global context features)
Uses a gating mechanism with sigmoid activation to generate fusion weights
This enables flexible, lesion-specific decision-making.
4. Classification Process
After fusion:
Global Average Pooling (GAP) reduces spatial dimensions
Fully connected layers with dropout improve generalization
Softmax classifier predicts probabilities across 8 classes
Categorical cross-entropy loss is used for training
5. Experimental Setup
Dataset: ISIC 2019
25,331 dermoscopic images
8 categories:
Melanoma (MEL)
Nevus (NV)
Basal cell carcinoma (BCC)
Actinic keratosis (AKIEC)
Benign keratosis (BKL)
Dermatofibroma (DF)
Vascular lesions (VASC)
Squamous cell carcinoma (SCC)
Training Details
Image size: 224×224
Data augmentation: rotation, flips, brightness, zoom, color jitter
Optimizer: Adam (learning rate 1e−4)
Cosine annealing scheduler
Batch size: 32
Early stopping
ImageNet pretrained weights
Evaluation metrics:
Accuracy
Precision
Recall
F1-score
ROC-AUC
6. Results
HyCoT-Net outperformed competing models:
Model
Accuracy
HybridSkinFormer
94.2%
Hybrid ConvNeXtV2
93.48%
HyCoT-Net (Proposed)
95.35%
Key findings:
Consistent improvements across all evaluation metrics
Better handling of intra-class variability and inter-class similarity
Enhanced interpretability using Grad-CAM++ visualizations
Improved robustness and generalization
7. Key Contributions
Novel hybrid CNN-Transformer architecture
Adaptive Fusion Module (AFM) for dynamic feature weighting
Improved classification accuracy over state-of-the-art models
Enhanced interpretability and clinical relevance
Conclusion
In summary, the experimental analysis confirms that the proposed HyCoT-Net outperforms the comparative networks, achieving the highest accuracy of 95.35% on the ISIC 2019 dataset. The superior performance underscores the effectiveness of the adaptive fusion strategy in leveraging complementary local and global features for robust classification. By demonstrating both high predictive accuracy and reliable generalization across complex dermoscopic images, HyCoT-Net presents a promising framework for automated skin cancer screening. Its adaptive and interpretable design offers practical utility for clinical applications, providing a step toward more accurate and efficient early detection of malignant lesions.
References
[1] R. L. Siegel, K. D. Miller, N. S. Wagle, and A. Jemal, “Cancer statistics, 2023,” CA Cancer J Clin, vol. 73, no. 1, pp. 17–48, Jan. 2023, doi: 10.3322/caac.21763.
[2] J. E. Gershenwaldet al., “Melanoma staging: Evidence?based changes in the American Joint Committee on Cancer eighth edition cancer staging manual,” CA Cancer J Clin, vol. 67, no. 6, pp. 472–492, Nov. 2017, doi: 10.3322/caac.21409.
[3] R. C. Maron et al., “A benchmark for neural network robustness in skin cancer classification,” Eur J Cancer, vol. 155, pp. 191–199, Sep. 2021, doi: 10.1016/j.ejca.2021.06.047.
[4] K. Behara, E. Bhero, and J. T. Agee, “AI in dermatology: a comprehensive review into skin cancer detection,” PeerJ Comput Sci, vol. 10, pp. 1–42, 2024, doi: 10.7717/peerj-cs.2530.
[5] R. Kumar, P. Kumbharkar, S. Vanam, and S. Sharma, “Medical images classification using deep learning: a survey,” Multimed Tools Appl, vol. 83, no. 7, pp. 19683–19728, Feb. 2024, doi: 10.1007/s11042-023-15576-7.
[6] Z. Mirikharajiet al., “A survey on deep learning for skin lesion segmentation,” Aug. 01, 2023, Elsevier B.V.doi: 10.1016/j.media.2023.102863.
[7] Y. Liang, S. Li, C. Yan, M. Li, and C. Jiang, “Explaining the black-box model: A survey of local interpretation methods for deep neural networks,” Neurocomputing, vol. 419, pp. 168–182, Jan. 2021, doi: 10.1016/j.neucom.2020.08.011.
[8] M. Lee, “Evaluating the Robustness of Explainable AI Models Against Adversarial Attacks in High-Stakes Domains,” 2023. [Online]. Available: https://www.researchgate.net/publication/391111496
[9] A. Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, Feb. 2017, doi: 10.1038/nature21056.
[10] Y. Liu, C. Li, F. Li, R. Lin, D. Zhang, and Y. Lian, “Advances in computer vision and deep learning-facilitated early detection of melanoma,” Brief Funct Genomics, vol. 24, Aug. 2025, doi: 10.1093/bfgp/elaf002.
[11] Bhattacharyya, Swarnava, Umapada Pal, and Tapabrata Chakraborti. Conformal Uncertainty Quantification to Evaluate Predictive Fairness of Foundation AI Model for Skin Lesion Classes across Patient Demographics. arXiv, 2025, doi:10.48550/arXiv.2503.23819.
[12] Ozdemir, B., Pacal, I. A robust deep learning framework for multiclass skin cancer classification. Sci Rep 15, 4938 (2025). https://doi.org/10.1038/s41598-025-89230-7
[13] Huang, Y.; Zhang, Z.; Ran, X.; Zhuang, K.; Ran, Y. An Ingeniously Designed Skin Lesion Classification Model Across Clinical and Dermatoscopic Datasets. Diagnostics 2025, 15, 2011. https://doi.org/10.3390/diagnostics15162011