With uses in robotics, industrial automation, autonomous vehicles, and surveillance, object detection is a basic computer vision problem. Within the context of the COCO dataset, this work compares the performance of several state-of-the-art object recognition models, including Mask R-CNN (Detectron2), YOLOv8s, YOLOv8l, and YOLOv11s. Some of the significant parameters such as mean Average Precision (mAP), precision, recall, and inference speed are utilized to compare models.
The results indicate that while Mask R-CNN is accurate, its computation makes it less suitable for real-time use. YOLO models, particularly YOLOv8s, are however a compromise between accuracy and speed and thus are ideal for real-time detection processes. YOLOv8l is however computationally more demanding but somewhat offers higher accuracy. Due to its speed and accuracy, YOLOv8s is the most suitable model to apply in real-time, as stated in the review. In selecting the most suitable object detection models for various applications, researchers and developers can learn a lot from this study.
Introduction
Objective
This study evaluates and compares several deep learning-based object detection models—Detectron2 (Mask R-CNN), YOLOv8s, YOLOv8l, and YOLOv11s—to determine the most suitable for real-time object detection based on performance metrics like precision, recall, mean Average Precision (mAP), and inference time using the COCO dataset.
Key Points
1. Background
Object detection is crucial in robotics, autonomous vehicles, security, and more.
Traditional models like Faster R-CNN offer high accuracy but are computationally heavy.
YOLO (You Only Look Once) models revolutionized object detection with faster inference and good accuracy.
2. Dataset
The COCO dataset (200,000+ images, 80 classes) was used to benchmark models.
It includes varied object sizes and complex scenes, ideal for evaluating generalization.
3. Models Overview
A. Detectron2 (Mask R-CNN)
Two-stage model (uses Region Proposal Networks).
High accuracy, especially for instance segmentation, but computationally expensive.
Performance: AP50 = 0.546, AR = 0.445; struggles with small objects.
B. YOLOv8s
Lightweight, optimized for edge devices and real-time use.
Larger, more accurate version of YOLOv8 with better object detection capabilities.
Higher precision (0.810) and recall, but requires more computation.
Best performance in terms of detection accuracy.
D. YOLOv11s
Advanced version focused on real-time detection with multi-scale feature fusion.
Similar mAP to YOLOv8s (mAP50-95 = 0.578) but better under certain conditions.
4. Model Comparison & Findings
Model
mAP50
mAP50-95
Inference Time
Key Advantage
Detectron2
0.546
0.375
High
High accuracy and segmentation
YOLOv8s
0.760
0.587
4.95 ms
Speed + Efficiency
YOLOv8l
0.775
0.610
Moderate
High accuracy
YOLOv11s
–
0.578
Low
Fast, robust detection
Conclusion
Herein, we have experimented and compared various object detection models such as Mask R-CNN (Detectron2), YOLOv8s, YOLOv8l, and YOLOv11s on accuracy measures (AP, mAP), precision, recall, and inference time with the COCO dataset. Our results indicate that although Mask R-CNN is appropriate for instance segmentation and has very high accuracy (AP = 0.375, AP50 = 0.546), it consumes a lot of resources and thus is not very appropriate for real-time use. Conversely, YOLO models such as YOLOv8s and YOLOv8l performed better at an mAP50 of 0.760 and 0.770, respectively, but with much lower inference times (4.95ms for YOLOv8s), but YOLOv8l was better than YOLOv8s with increased computations.
While YOLOv11 models have an mAP50-95 value of 0.578, they were not significantly better compared to YOLOv8 models.
The top real-time object detection model is YOLOv8s when inference speed, accuracy, and computational efficiency trade-offs are considered. It can be used for real-time tracking, surveillance, and autonomous use due to its accuracy-speed ratio. Model optimization and hybrid methods to improve detection efficiency can be explored in future research.
References
[1] S. Noor, M. Waqas, M. I. Saleem, and H. N. Minhas, “Automatic Object Tracking and Segmentation Using Unsupervised SiamMask,” IEEE Access, vol. 9, pp. 106550-106559, 2021.
[2] A. Obi-Obuoha, V. S. Rizama, I. Okafor, H. E. Ovwenkekpere, K. Obe, and J. Ekundayo, “Real-time traffic object detection using detectron 2 with faster R-CNN,” World Journal of Advanced Research and Reviews, vol. 24, no. 02, pp. 2173-2189, 2024.
[3] W. Fang, L. Wang, and P. Ren, “Tinier-YOLO: A Real-Time Object Detection Method for Constrained Environments,” IEEE Access, vol. 8, pp. 1935-1944, 2020.
[4] P. Adarsh, P. Rathi, and M. Kumar, “YOLO v3-Tiny: Object Detection and Recognition using one stage improved model,” in 2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS), 2020, pp. 687-691.
[5] C. Liu, Y. Tao, J. Liang, K. Li, and Y. Chen, “Object Detection Based on YOLO Network,” in 2018 IEEE 4th Information Technology and Mechatronics Engineering Conference (ITOEC 2018), 2018, pp. 50-54.
[6] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” arXiv:1506.02640, 2015.
[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” arXiv:1612.08242v1, 2016.
[8] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv:1804.02767, 2018.
[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580-587.
[10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 2017.
[11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in European conference on computer vision, 2016, pp. 21-37.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
[13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778. [cite: 1011, 1012, 13, 14, 15, 16, 45] .
[14] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980-2988.
[15] G. Bradski and A. Kaehler, “Learning OpenCV: Computer vision with the OpenCV library,” O\'Reilly Media, 2008.