Detecting small objects in drone-captured images is an especially challenging task due to factors such as scale variations, occlusions, and cluttered back- grounds. Traditional CNN-based methods like Faster R- CNN and YOLO perform very well on larger objectsbut often miss finer details needed for small object detection. Vision Transformers (ViTs) offer a promising alternativewiththeirglobalself-attentioncapabilities,yet theytypicallyincurhighcomputationalcoststhathinder real-time applications.
In this paper, we introduce ViT-YOLOv8, a hybrid model that merges the efficiency of CNN-based detec- tion with the global context understanding of Vision Transformers. Our approach enriches the classic Dark- net architecture with multi-head self-attention (MHSA- Darknet) and integrates a modified C3-PANet with CARAFE upsampling to enhance multi-scale feature fusion. Additionally, our anchor-free detection head di- rectlypredictsobjectcentersanddimensions,whichleads to improved localization of small, irregularly shaped objects.
Through extensive experiments on the VisDrone- DET2019 dataset, our model shows an improvement of approximately 3.5 percentage points in mean average precision(mAP)overbaselineYOLOv8,whilestilldeliv- ering real-time performance. Ablation studies and real- worldsightingsfurtherunderlinetheimportanceofeach component. We believe that ViT-YOLOv8 sets a new benchmarkinUAV-basedsmallobjectdetectionandcan be foundational for applications in surveillance, disaster management, and beyond.
Introduction
The paper focuses on improving small object detection in UAV (drone) imagery using a hybrid deep learning model called ViT-YOLOv8, which combines CNN-based feature extraction with Vision Transformers. The motivation arises from the difficulty of detecting tiny, often occluded objects such as pedestrians, vehicles, or damage in aerial images—an issue critical in applications like disaster response, surveillance, wildlife monitoring, and infrastructure inspection.
To address these challenges, the model integrates:
MHSA-enhanced Darknet backbone for capturing both local and global image context,
C3-PANet with CARAFE upsampling for better multi-scale feature fusion and detail preservation,
Anchor-free detection head for improved and simplified object localization.
The model is trained on the VisDrone-DET2019 dataset, using data augmentation and evaluated using metrics such as precision, recall, mAP, IoU, and FPS.
Experimental results show that ViT-YOLOv8 outperforms baseline models like YOLOv5, YOLOv7, and YOLOv8, achieving higher accuracy (mAP50 = 36.9) while maintaining real-time performance (109 FPS). Ablation studies confirm that each component (attention, CARAFE, anchor-free design) contributes significantly to performance gains.
The study concludes that the hybrid approach effectively improves small object detection in complex UAV environments, though challenges remain in computational efficiency, cross-dataset generalization, and real-world deployment. Future work includes lightweight transformers, multimodal fusion, domain adaptation, and ethical deployment considerations.
Conclusion
In this paper, we presented ViT-YOLOv8—a novel hybrid model that combines the efficiency of CNNs with the global contextual understanding of Vision Transformers to improve small object detection in UAV imagery. By integrating a multi-head self- attention-enhanced Darknet backbone, a refined C3- PANet with CARAFE upsampling, and an anchor-free detection head, our model achieves significant improvements in both accuracy and speed.
OurextensiveexperimentsontheVisDrone- DET2019 dataset demonstrate that ViT-YOLOv8 out- performsstate-of-the-artmethods,increasingthemean averageprecisionbyapproximately3.5pointsand achievinganinferencespeedof109FPS.Detailed ablationstudiesandadditionalreal-worldobservations confirm the critical contributions of each component. Althoughchallengesremain—particularlyinfur- therreducingcomputationaloverheadandensuring broad generalizability—ViT-YOLOv8 represents a sig- nificantstepforwardinUAV-basedobjectdetection. Webelievethatourworksetsanewbenchmarkfor smallobjectdetectionandprovidesarobustfoundation forthenextgenerationofreal-timeUAVsurveillance
systems.
References
[1] X.Zhao,Y.Xia,W.Zhang,C.Zheng,andZ.Zhang,”YOLO-ViT-Based Method for Unmanned Aerial Vehicle InfraredVehicle Target Detection,” Remote Sens., vol. 15, p. 3778,2023. Available: https://doi.org/10.3390/rs15153778.
[2] D. Du et al., ”VisDrone-DET2019: The Vision Meets DroneObject Detection in Image Challenge Results,” in Proc. ICCVWorkshops, 2019. Available: http://www.aiskyeye.com/.
[3] P. Zhang, X. Li, and Y. Zhong, ”ViT-YOLO: Transformer-Based YOLO for Object Detection,” in Proc. IEEE ICCVWorkshops, 2021.
[4] S. Mehta and M. Rastegari, ”MobileViT: Light-weight,General-purpose, and Mobile-friendly Vision Transformer,”arXiv, 2021. Available: https://doi.org/10.48550/arXiv.2110.02178.
[5] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen,”MobileNetV2:InvertedResidualsandLinearBottlenecks,”inProc. IEEE/CVF CVPR, 2018, pp. 4510–4520.
[6] J. Wang, K. Chen, R. Xu, Z. Liu, C. Loy, and D. Lin,”CARAFE:Content-AwareReassemblyofFEatures,”inProc.IEEE/CVF ICCV, 2019. Available: https://doi.org/10.1109/ICCV.2019.00310.
[7] D. Arthur and S. Vassilvitskii, ”K-Means++: The Advantagesof Careful Seeding,” in Proc. SODA, New Orleans, LA, USA,2007.
[8] H. Law, Y. Teng, O. Russakovsky, and J. Deng, ”Cornernet-lite: Efficient keypoint-based object detection,” arXiv, 2019.
[9] T.Lin,P.Dolla´r,R.B.Girshick,K.He,B.Hariharan,andS.J.Belongie, ”Feature Pyramid Networks for Object Detection,”in Proc. IEEE CVPR, 2017, pp. 936–944.
[10] J. Redmon and A. Farhadi, ”YOLOv3: An Incremental Im-provement,” arXiv, 2018.
[11] S.Ren,K.He,R.B.Girshick,andJ.Sun,”FasterR-CNN:Towardsreal-timeobjectdetectionwithregionproposalnetworks,” in Proc. NeurIPS, 2015, pp. 91–99.
[12] X. Wu, W. Li, D. Hong, R. Tao, and Q. Du, ”Deep Learningfor Unmanned Aerial Vehicle-Based Object Detection andTracking: A Survey,” Geosci. Remote Sens., vol. 10, pp. 91–124, 2022.
[13] Z. G. Darehnaei, M. Shokouhifar, and H. Yazdanjouei, ”SI-EDTL: Swarm intelligence ensemble deep transfer learningfor multiple vehicle detection in UAV images,” ConcurrencyComputation, vol. 34, 2021.
[14] S. Cao, J. Deng, J. Luo, Z. Li, J. Hu, and Z. Peng, ”LocalConvergence Index-Based Infrared Small Target DetectionAgainst Complex Scenes,” Remote Sens., vol. 15, 2023.
[15] Z. Tian, C. Shen, H. Chen, and T. He, ”FCOS: Fully Convo-lutional One-Stage Object Detection,” arXiv, 2019.
[16] J.Pang,K.Chen,J.Shi,H.Feng,W.Ouyang,andD.Lin, ”Libra R-CNN: Towards Balanced Learning for ObjectDetection,” in Proc. IEEE CVPR, 2019.