Withusesinrobotics,industrialautomation,autonomousvehicles,andsurveillance,objectdetectionis a basic computer vision problem. WithinthecontextoftheCOCOdataset,thisworkcomparestheperformanceof severalstate-of-the-artobjectrecognition models,includingMaskRCNN (Detectron2),YOLOv8s,YOLOv8l,and YOLOv11s.Some of the significant parameters such as mean Average Precision (mAP), precision, recall, and inference speed are utilized to compare models.
The results indicate that while Mask R-CNN is accurate, its computation makes it less suitable for real-time use. YOLOmodels,particularlyYOLOv8s,arehoweveracompromisebetweenaccuracyandspeedandthusareideal for real-time detection processes.YOLOv8l is however computationally more demanding but somewhat offers higheraccuracy. Duetoitsspeedand accuracy,YOLOv8sisthemostsuitablemodeltoapplyinreal-time,asstated in the review. In selecting the most suitable object detection models for various applications, researchers and developers can learn a lot from this study
Introduction
Object detection is a key area of computer vision enabling machines to identify and locate objects in images or videos, with applications in robotics, autonomous vehicles, medical imaging, and security. Recent advances in deep learning have produced models like Mask R-CNN (Detectron2), YOLOv8 (small and large), and YOLOv11s, which aim to balance accuracy and speed, especially for real-time use.
Earlier models such as Faster R-CNN were accurate but computationally heavy, limiting real-time deployment. YOLO models revolutionized the field by enabling one-stage detection, achieving faster inference without significant loss in accuracy. Detectron2 (Mask R-CNN) offers precise object segmentation but requires high computational power, making it less suited for real-time applications.
The study evaluates these models using the COCO dataset, focusing on metrics like precision, recall, mean Average Precision (mAP), and inference time. YOLOv8l showed high accuracy with good speed, YOLOv11s improved detection flexibility and speed, while YOLOv8s provided an efficient balance suitable for edge devices. Detectron2 delivered superior segmentation accuracy but was slower.
Ultimately, the choice depends on the specific application's need for speed, accuracy, and computational resources. YOLO models are preferable for real-time tasks, whereas Detectron2 suits scenarios demanding detailed segmentation.
Conclusion
Herein,wehaveexperimentedandcomparedvariousobject detectionmodelssuchasMaskR-CNN(Detectron2), YOLOv8s,YOLOv8l,andYOLOv11son accuracy measures (AP, mAP), precision, recall, and inference time with the COCOdataset.OurresultsindicatethatalthoughMask RCNNisappropriateforinstancesegmentationandhasvery highaccuracy (AP=0.375,AP50=0.546), itconsumesalot of resources and thus is not very appropriate for real-time use. Conversely,YOLOmodelssuchasYOLOv8sandYOLOv8l performedbetteratanmAP50of0.760and0.770, respectively, butwith much lower inference times (4.95ms for YOLOv8s), but YOLOv8l was better than YOLOv8s with increased computations.
WhileYOLOv11modelshaveanmAP50-95valueof0.578, they were not significantly better compared to YOLOv8 models.
The topreal-timeobjectdetection model isYOLOv8s when inferencespeed,accuracy,andcomputationalefficiency trade-offs are considered.It can be used for real-time tracking, surveillance, and autonomous use due to its accuracy-speed ratio.Model optimization and hybrid methods to improve detection efficiency can be explored in future research.
References
[1] S. Noor, M. Waqas, M. I. Saleem, and H. N. Minhas, “AutomaticObject Tracking and Segmentation Using Unsupervised SiamMask,”IEEE Access, vol. 9, pp. 106550-106559, 2021.
[2] A.Obi-Obuoha,V.S.Rizama,I.Okafor,H.E.Ovwenkekpere,K.Obe,and J. Ekundayo, “Real-timetraffic objectdetection using detectron 2with faster R-CNN,” World Journal of Advanced Research andReviews, vol. 24, no. 02, pp. 2173-2189, 2024.
[3] W. Fang, L. Wang, and P. Ren, “Tinier-YOLO: A Real-Time ObjectDetection Method for Constrained Environments,” IEEEAccess, vol.8, pp. 1935-1944, 2020.
[4] P.Adarsh,P.Rathi,andM.Kumar,“YOLOv3-Tiny:ObjectDetectionand Recognition using one stage improved model,” in 2020 6thInternational Conference onAdvanced Computing & CommunicationSystems (ICACCS), 2020, pp. 687-691.
[5] C. Liu,Y.Tao, J. Liang, K.Li, andY. Chen,“ObjectDetection Basedon YOLO Network,” in 2018 IEEE 4th Information Technology andMechatronicsEngineeringConference(ITOEC2018),2018,pp.5054.
[6] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” arXiv:1506.02640, 2015.
[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “YOLO9000:Better, Faster, Stronger,” arXiv:1612.08242v1, 2016.
[8] J.RedmonandA.Farhadi,“YOLOv3:AnIncrementalImprovement,”arXiv:1804.02767,2018.
[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for object detection and semantic segmentation,” inProceedings of the IEEE conference on computer vision and patternrecognition, 2014, pp. 580-587.
[10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towardsrealtime object detection with region proposal networks,” IEEETransactionsonPatternAnalysisandMachineIntelligence,vol.39,no.6, pp. 1137-1149, 2017.
[11] W.Liu,D.Anguelov,D.Erhan,C.Szegedy,S.Reed,C.Y.Fu,andA.
[12] C.Berg,“SSD:Singleshotmultiboxdetector,”inEuropeanconferenceon computer vision, 2016, pp. 21-37.
[13] A. Krizhevsky, I. Sutskever, and G.E.Hinton,“ImageNet classificationwith deep convolutional neural networks,” Advances in neuralinformationprocessing systems,vol.25,2012.[13]K. He,X. Zhang,
[14] S. Ren, and J. Sun, “Deep residuallearning for imagerecognition,”inProceedings of the IEEE conference on computer vision and patternrecognition,2016, pp.770-778.[cite:1011, 1012, 13,14,15,16, 45].
[15] T.-Y.Lin,P. Goyal,R. Girshick,K.He,and P. Dollár, “Focalloss fordense object detection,” in Proceedings of the IEEE internationalconference on computer vision, 2017, pp. 2980-2988.
[16] G.BradskiandA.Kaehler,“LearningOpenCV:Computervisionwiththe OpenCV library,” O\'Reilly Media, 2008.