Semantically segmented images can be processed pixel-wise and have many applications in satellite image inter- pretation, environmental monitoring, and city planning. However, performing segmentation on high-resolution images is difficult be- cause there are differences in scene complexities and scales. This paper discusses a comparison between two popular deep learning models: U-Net and DeepLabV3+ when working with high- resolution images and binary semantic segmentation tasks based on high-resolution satellite images. While U-Net incorporates the symmetrical encoder-decoder structure with skip connections, DeepLabV3+ incorporates atrous convolutional networks and Atrous Spatial Pyramid Pooling (ASPP) modules to include multiscale contextual features. Both models are benchmarked according to various metrics, including accuracy and Intersection over Union (IoU), for the binary class (vegetation/non-vegetation) of the DeepGlobe dataset. According to the findings, DeepLabV3+ outperforms U-Net through higher IoUs and consistency in segmentation tasks.
Introduction
Semantic segmentation assigns a label to each pixel in an image and is widely used in applications such as remote sensing, medical imaging, and autonomous systems. Satellite image analysis is challenging due to high resolution, varying object scales, occlusions, and complex backgrounds. The study focuses on a binary classification problem (vegetation vs non-vegetation) using the DeepGlobe dataset.
The research highlights that traditional methods are less effective, leading to the adoption of deep learning approaches. U-Net is known for its encoder–decoder structure with skip connections, which helps capture fine spatial details but struggles with global context. DeepLabV3+ improves performance using atrous (dilated) convolutions and the ASPP module, allowing better multi-scale feature extraction and stronger global context understanding.
The methodology includes dataset preprocessing (resizing, normalization, augmentation) and training both models under the same conditions using loss functions like categorical cross-entropy and optimizers like Adam. Performance is evaluated using accuracy and Intersection over Union (IoU).
Conclusion
The current research provided a comparative analysis of U- Net and DeepLabV3+ networks utilized for high-resolution semantic image segmentation. Both algorithms were tested in qualitative and quantitative ways, considering the accuracy and IoU indicators.
Based on the results of the experiment, it is possible to conclude that DeepLabV3+ demonstrates better performance, especially when used in complicated situations. Due to the employment of atrous convolution and ASPP layers, this network captures multi-scale information, which improves the efficiency of object recognition and boundary detection. Meanwhile, U-Net successfully identifies fine-grained data but fails to grasp global information, which leads to fragmented results in difficult conditions.
Thus, DeepLabV3+ represents an optimal choice for com- plicated high-resolution tasks, whereas U-Net is more advan- tageous from the computational perspective. For future investigations, it would be reasonable to con- centrate on developing hybrid architectures based on the considered techniques. Also, one may try to optimize exist- ing approaches and implement innovative loss functions or attention modules.
References
[1] O. Ronneberger, P. Fischer, and T. Brox, ”U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Proc. Int. Conf. Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, pp. 234–241.
[2] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, ”Encoder- Decoder with Atrous Separable Convolution for Semantic Image Seg- mentation,” in Proc. European Conf. Computer Vision (ECCV), 2018, pp. 801–818.
[3] J. Long, E. Shelhamer, and T. Darrell, ”Fully Convolutional Networks for Semantic Segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.
[4] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, ”Pyramid Scene Parsing Network,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2881–2890.
[5] I. Demir et al., ”DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 172–181.
[6] X. Liu, Y. Deng, and T. Li, ”Deep Learning for Image Segmentation: A Survey,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 42, no. 7, pp. 1737–1754, 2020.
[7] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos,, Image Segmentation Using Deep Learning: A Survey,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3523–3542, 2021.
[8] V. Badrinarayanan, A. Kendall, and R. Cipolla, ”SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
[9] K. He, X. Zhang, S. Ren, and J. Sun, ”Deep Residual Learning for Image Recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[10] X. Zhu et al., “Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8–36, 2017.