To enhance the feature extraction capability and computational efficiency of Convolutional Neural Networks (CNNs) in image classification tasks, this paper proposes a novel attention-augmented architecture, SGFA-ConvNeXt, based on the ConvNeXt backbone. The model embeds Spatial Gated Fusion Attention (SGFA) modules at critical transition points of each stage. These modules adopt a dual-branch parallel structure to model salient features along spatial and channel dimensions. The spatial branch combines multi-scale pooling with depthwise convolution to effectively capture long-range dependencies, while the channel branch utilizes global average pooling to recalibrate channel weights for feature refinement. Ultimately, the two branches are fused via a gating mechanism and residual connection, enhancing representational capacity while preserving gradient stability. Experimental results on the CIFAR-10 dataset demonstrate that SGFA-ConvNeXt improves classification accuracy by over 2% compared to the ConvNeXt-Tiny baseline, with only a marginal increase in FLOPs. Moreover, it shows competitive performance among various advanced CNN architectures. Ablation studies further validate the complementary nature of the spatial and channel attention paths in SGFA, underscoring its effectiveness in performance enhancement under low computational cost. This method offers a novel design strategy for efficient image classification in resource-constrained scenarios.
Introduction
Image classification, a key task in computer vision, has greatly improved with deep learning, especially CNNs like ConvNeXt. However, challenges remain including high computational cost, limited ability to capture both global and local features, and inconsistent sensitivity to diverse features.
To address these, the study proposes enhancing ConvNeXt by integrating a Spatial Gated Fusion Attention (SGFA) module that combines spatial and channel attention mechanisms to improve feature representation while keeping computation efficient. SGFA uses depthwise separable convolutions and layer normalization, embedded at key network stages to boost classification accuracy.
Experiments on CIFAR-10 show that SGFA-ConvNeXt outperforms baseline models with minimal extra cost. Ablation studies confirm the effectiveness of each SGFA component. The work advances lightweight CNN design by balancing efficiency and improved feature extraction, suitable for resource-limited environments.
Conclusion
To address the issues of insufficient recognition accuracy and incomplete feature extraction in current image classification tasks, this study proposes an improved model—SGFA-ConvNeXt—based on the ConvNeXt-Tiny architecture, integrating both spatial and channel attention. By incorporating the Spatial Gated Fusion Attention (SGFA) module at the end of each stage, the model performs multiscale and multidimensional key region modeling and channel feature enhancement on vehicle images, thereby improving both its discriminative capability and feature representation.
Experimental results on the CIFAR-10 dataset demonstrate that SGFA-ConvNeXt outperforms the original ConvNeXt and other mainstream models while maintaining low computational overhead, achieving a 2% increase in classification accuracy on CIFAR-10.
The primary contributions of this research are twofold:
1) The design of the SGFA module, which fuses spatial and channel attention, effectively addresses the difficulty traditional networks face in simultaneously modeling local details and global semantics.
2) A significant improvement in classification performance is achieved without compromising efficiency, making the method well-suited for resource-constrained environments such as edge computing.
Future research can focus on further optimizing the SGFA module to enhance its performance on more complex tasks and exploring its application to other image classification domains, such as medical image analysis and object detection. Additionally, how to further reduce computational cost while maintaining performance is also a promising direction for in-depth investigation.
References
[1] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 2002, 86(11): 2278-2324.
[2] Liu Z, Mao H, Wu C Y, et al. A ConvNet for the 2020s[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA, 2022: 11966-11976.
[3] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[4] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
[5] Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision
[6] Ma N, Zhang X, Zheng H T, et al. Shufflenet v2: Practical guidelines for efficient cnn architecture design[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 116-131.
[7] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.
[8] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 3-19.
[9] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
[10] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.
[11] Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.
[12] Jamali A, Roy S K, Hong D, et al. Spatial-Gated Multilayer Perceptron for Land Use and Land Cover Mapping[J]. IEEE Geoscience and Remote Sensing Letters, 2024, 21: 1-5.
[13] Liu X, Peng H, Zheng N, et al. Efficientvit: Memory efficient vision transformer with cascaded group attention[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 14420-14430.
[14] Krizhevsky A. Learning multiple layers of features from tiny images[Z]. Toronto: University of Toronto, 2009.