Interstitial Lung Disease (ILD) is a collection of progressive pulmonary conditions that impact on the pulmonary tissue structure and progressively diminish respiratory function. Early identification of ILD is still a difficult task since radiological differences are delicate and may be confused with other respiratory diseases. In the recent past, artificial intelligence has been developed to analyze medical images automatically and this has offered a new chance of enhancing diagnostic accuracy. The proposed study will develop a hybrid framework of deep learning based on Vision Transformer (ViT) and Convolutional Neural Network (CNN) structures to performearly detecting of ILD using chest CTscansand X-rays. The CNN element is concerned with the extractionoffine-scale localspacefeatures,including the texture anomalies and fibrotic structures, the Vision Transformer is concerned with the global contextual relationship between lung regions and other regions through the self-attention mechanisms. The representations extracted are combined to come up with an integrated prediction model that can differentiate between lungs with ILD and normal ones. A web-based clinicalsupport system iscreatedto facilitatethe real-time predictionbygivingmedicalpractitionersanopportunity to post-imaging data and receive automated diagnostic information. As shown in the experiments, the hybrid architecture suggested is better at classification than the single-model solutions, especially on the detection of early-stage abnormalities.
Introduction
This paper proposes a hybrid Vision Transformer–Convolutional Neural Network (ViT-CNN) framework for the early detection of Interstitial Lung Disease (ILD) using chest CT scans and X-ray images. ILD is a group of lung disorders characterized by inflammation and fibrosis that can lead to respiratory failure if not diagnosed early. Conventional diagnosis relies on manual interpretation of medical images, which is time-consuming, requires expert radiologists, and is prone to inter-observer variability, often missing subtle abnormalities in the early stages. While CNNs effectively extract local image features, they have limited ability to capture long-range spatial relationships, whereas Vision Transformers (ViTs) excel at learning global contextual information. The proposed hybrid model combines the strengths of both architectures to improve diagnostic accuracy and support clinicians with faster and more reliable decision-making.
The system uses publicly available CT and X-ray datasets containing both healthy and ILD cases. Images undergo preprocessing steps including resizing, normalization, noise reduction, lung region enhancement, and data augmentation to improve model robustness and reduce overfitting. The CNN branch extracts local texture features such as fibrosis and ground-glass opacities, while the ViT branch captures global structural relationships across the lungs. The extracted features are fused and processed through fully connected layers to classify the presence and severity of ILD. The model is implemented using TensorFlow or PyTorch with supporting tools such as OpenCV and scikit-learn, and is deployed through a Flask or FastAPI backend with a React-based web interface, enabling clinicians to upload medical images and receive real-time predictions along with explainable Grad-CAM heatmaps.
Experimental results demonstrate that the hybrid ViT-CNN model outperforms standalone CNN and ViT models in terms of accuracy, precision, recall, F1-score, and ROC-AUC. It achieves higher sensitivity for early-stage ILD detection, reducing false negatives and enabling earlier treatment. Visualization techniques such as Grad-CAM and attention maps confirm that the model focuses on clinically relevant lung regions, improving interpretability and clinician trust. The system also shows stable training, strong generalization to unseen data, and fast inference times suitable for real-world clinical deployment. Overall, the proposed framework provides an accurate, scalable, and explainable AI-based decision support system for early ILD diagnosis, with the potential to enhance patient outcomes and reduce diagnostic variability.
Conclusion
Through the integration of CNN-based local feature extraction with the global contextual learning potential of transformer models, the suggested methodology shows a better diagnostic performance with respect to traditional single-model methods. The hybrid design allows making betterdetectionoffaintlunganomaliesespeciallyintheearly stages of diseases when the visual patterns can be very hard to discern during the manual assessment. The designed system is also an intelligent clinical decision-supportsystem which can deliver prompt, predictable, and interpretable predictions based on a web-based interface, thus helping healthcareprofessionalsto enhancediagnosticeffectiveness and decrease analysis time. Experimental assessment proves the efficiency, strength, and functionality of transformer- based hybrid learning in terms of medical image analysis contexts. Regardless of the positive outcomes, there are a numberoffutureimprovementopportunities.Futureresearch willinvolvetheextensionofthemodeltoscanentirevolumes of full 3D CT scan to obtain more spatial information and more accurate detection of complex ILD patterns. Improved attention-basedfusionmethodscanbeinvestigatedtohavean evenbetterfeatureintegrationbetween CNNandtransformer elements. Increasing the data by using multi-institutional cooperationwillassistinenhancingthemodelgeneralization toawiderrangeofpatientsandimagingconditions.Also,the framework can be expanded to accommodate multi-disease classification to identify other pulmonary diseases like pneumonia,COPDandlungcancer.Additionalstudiescanbe conducted as well, such as directly validating in real-time clinical, lightweight model optimisation as an edge deployable model, and integration with electronic health record to implement continuous and scalable AI-assisted healthcare solutions.
References
[1] J.Li,J.Chen,Y.Tang,C.Wang,B.A.LandmanandS.K. Zhou, Transforming medicalimagingusing transformers? Comparison Review of essential Properties, up-to-date Advances, and Future Proximalities, Medical Image Analysis, vol. 85, p. 102762, 2023.
[2] Y.Zhang,J.Wang,J.M.GorrizandS.Wang,“Deep Learning and Vision Transformer for Medical ImageAnalysis,”JournalofImaging,vol.9,no.7,p.147, 2023.
[3] A.Halder,S.Gharami,P.Sadhu,P.K.Singh,M.Wo?niak and M. F. Ijaz, “Implementing Vision Transformer for Classifying 2D Biomedical Images,” Scientific Reports, vol. 14, 2024.
[4] Sarmadi, Z. S. Razavi, A. J. van Wijnen et al., “Comparative Analysis of Vision Transformers and Convolutional Neural Networks in Osteoporosis DetectionfromX-rayImages,”ScientificReports,vol.14, Art. no. 18007, 2024.
[5] J. Zhang, F. Li, X. Zhang, H. Wang and X. Hei, “AutomaticMedicalImageSegmentationwithVision Transformer,”AppliedSciences,vol.14,no.7,p.2741,2024.
[6] S.Raminedi,S.ShrideviandD.Won, “Multi-Modal Transformer Architecture for Medical Image Analysis and Automated Report Generation,” Scientific Reports, vol. 14, Art. no. 19281, 2024..
[7] Halder, S. Gharami, P. Sadhu, P. K. Singh, M. Wo?niak and M. F. Ijaz,“Implementing Vision Transformer for Classifying 2D Biomedical Images,” Scientific Reports, vol. 14, Art. no. 12567, May 2024.
[8] Hybrid Vision Transformer Architectures with CNN Blocks for Multi-Label Chest Disease Classification,” Power System Technology Journal, vol. 49, no. 1, Apr. 2025.
[9] J.Qezelbash-ChamakandK.Hicklin, “A Hybrid Learnable Fusion of ConvNeXt and Swin Transformer for Optimized Image Classification,” IoT, vol. 6, no. 2, p. 30, May 2025.
[10] Safdar and M. Saadeldin, “CoMViT:AnEfficientVisionTransformerBackbonefor Supervised Classification in Medical Imaging,” arXiv preprint arXiv:2510.27442, 2025.
[11] J.W.Kim,A.U.KhanandI.Banerjee, “Systematic Review of Hybrid Vision Transformer Architectures for Radiological Image Analysis,” Journal of Imaging Informatics in Medicine, vol. 38, pp. 3248– 3262, 2025..