A Comprehensive Review of Real-Time Shape Detection and Object Tracking Techniques in Computer Vision

Authors: Sachin Chavan, Dr. Ramesh Manza

DOI Link: https://doi.org/10.22214/ijraset.2026.76991

Abstract

Real time shape determination and tracking objects are the main components of modern computer vision that lead to the range of high-level uses that include autonomous navigation, robotics, smart surveillance, and human-computer interfaces. Given the growing dependence on autonomous systems, there is an increasingly pressing need to have algorithms that can provide supreme accuracy and computational latency at the same time. The review provides an in-depth discussion of the development of shapes detection and object tracking methodologies and its latest state-of-the-art methods, focusing on real-time operation. It is critical of usage transforming traditional methods of image processing, like edge detection and Kalman filter, to deep learning models, like Convolutional Neural Networks (CNNs) and more recent models like Vision Transformers (ViTs). Moreover, the review groups and analyzes popular open-source datasets, such as COCO, MOTChallenge, and KITTI, and outlines the consequences of owned data sources. Quantitative empirical evaluation of the integrated systems (Tracking-by-Detection and Joint Detection and Embedding (JDE)) is given and supported with the quantitative performance measures like Mean Average Precision (mAP) and Multi-Object Tracking Accuracy (MOTA). Lastly, the paper establishes the enduring issues of occlusion, variation of scales, deployment of edges, and suggests future research avenues that could help solve the gap between theory and practical applications.

Introduction

Shape detection and object tracking represent the spatial and temporal dimensions of computer vision, respectively.

Shape detection identifies geometric structures and precise object boundaries in images, evolving into instance segmentation in deep learning frameworks.
Object detection determines what the object is and where it is located (bounding boxes + classification).
Object tracking determines where the object is going by maintaining its identity across video frames despite motion or occlusion.

These technologies are critical in Industry 4.0, autonomous driving, medical robotics, and surveillance, where real-time performance (typically >30 FPS) is essential.

1. Evolution of Methods

A. Classical (Pre–Deep Learning Era)

Early approaches relied on handcrafted features and geometric modeling:

Edge detection (Canny)
Active Contours (Snakes)
Viola–Jones (Haar cascades)
HOG + SVM for pedestrian detection
Hough Transform for geometric primitives

These methods were computationally efficient but lacked semantic understanding and robustness to illumination changes, noise, and occlusion.

B. Deep Learning Revolution

Deep learning fundamentally changed object detection and tracking:

Two-Stage Detectors

R-CNN → Fast R-CNN → Faster R-CNN
High accuracy but slower inference speed.

One-Stage Detectors

YOLO (You Only Look Once)
SSD (Single Shot MultiBox Detector)
Faster, real-time capable, slightly lower accuracy (early versions).

Instance Segmentation

Mask R-CNN (adds mask prediction branch)
YOLACT (real-time segmentation above 30 FPS)

Transformer-Based Models

DETR (Detection Transformer)
TrackFormer
End-to-end attention-based detection and tracking without manual components like NMS.

2. Object Tracking Approaches

The dominant paradigm is Tracking-by-Detection (TBD):

SORT – Uses Kalman Filter + Hungarian algorithm (high speed).
DeepSORT – Adds appearance-based Re-ID to reduce identity switches.
JDE (Joint Detection & Embedding) – Combines detection and feature embedding.
ByteTrack – Improves data association using low-confidence detections.
Transformer-based tracking (TrackFormer) integrates detection and tracking in a single model.

Key challenges in tracking include:

Identity switches during occlusion
Computational cost of frame-wise detection
Real-time processing constraints

3. Datasets

Open-Source Benchmarks

Widely used datasets provide standardized evaluation:

COCO – Object detection & instance segmentation (80 classes)
PASCAL VOC – Earlier benchmark (20 classes)
Open Images Dataset – 9M images, large-scale annotations
KITTI – Autonomous driving (3D detection & tracking)
MOTChallenge – Multi-object tracking benchmark
DAVIS – Video object segmentation
ShapeNet – 3D CAD models
Cityscapes – Urban scene segmentation

These datasets vary in domain specificity and annotation type (bounding boxes, masks, 3D labels).

Proprietary Datasets

Industries (e.g., autonomous driving and healthcare) rely on private datasets:

Tesla, Waymo (driving edge cases)
Hospital MRI/CT databases (tumor segmentation)

While powerful, they limit reproducibility in academic research.

4. Performance Trends (Comparative Analysis)

The field has shifted:

From high-accuracy but slow two-stage detectors (e.g., Faster R-CNN)
To real-time one-stage models (e.g., YOLO, SSD)
To efficient tracking-by-detection frameworks (DeepSORT, ByteTrack)
Toward transformer-based end-to-end systems (DETR, TrackFormer)

The speed–accuracy trade-off gap is steadily narrowing.

5. Evaluation Metrics

Performance is measured using standard quantitative metrics:

Intersection over Union (IoU) – Overlap between prediction and ground truth.
Mean Average Precision (mAP) – Detection accuracy across classes and IoU thresholds.
MOTA (Multiple Object Tracking Accuracy) – Combines false positives, false negatives, and identity switches.
IDF1 Score – Measures identity preservation over time.
Frames Per Second (FPS) – Determines real-time capability (≥30 FPS typically required).

6. Key Challenges

Despite progress, major limitations remain:

Sensitivity to lighting changes, motion blur, and background clutter
Identity switches during occlusion
High computational cost of deep models
Dataset bias and poor generalization to new domains
Balancing real-time speed with high accuracy

Conclusion

This review gives a methodological review on real-time shape detection and tracking of objects to outline the path traversed by heuristically based edge detection and complex deep-learning based pipelines. We show that on limited datasets that may be considered as a static benchmark, e.g. COCO, the accuracy has outdone ethical human performance; however, maintaining accuracy on demanding datasets like real-time video processing is quite challenging. The current superiority of the YOLO family in detection, as well as the hybrid trackers such as DeepSORT, confirm the desire of the community to find the optimal balance between both speed and accuracy. Also, there is a achieved concept of shape detection into tracking, as in the andragogy of the MOTS framework, that provides a deeper understanding of the scene, yet with a high computational cost. Lastly, implementation of Vision Transformers promises to change the paradigm to more globally, context-aware reasoning, but still requires additional optimization to achieve the performance of convolutional neural network on edge devices.

References

[1] Wagh, C. (2023). Object detection and tracking using deep learning and OpenCV in real time environment. Int J Eng Res Technol (IJERT), 12(04). [2] Alkhamaiseh, K. N., Grantner, J. L., Abdel-Qader, I., & Shebrain, S. (2023). Towards real-time multi-class object detection and tracking for the FLS pattern cutting task. Adv. Sci. Technol. Eng. Syst. J., 95(6), 87-95. [3] Godil, A., Bostelman, R., Shackleford, W., Hong, T., & Shneier, M. (2014). Performance metrics for evaluating object and human detection and tracking systems. National Institute of Standards and Technology, 1-16. [4] Jiao, L., Zhang, F., Liu, F., Yang, S., Li, L., Feng, Z., & Qu, R. (2019). A survey of deep learning-based object detection. IEEE access, 7, 128837-128868. [5] Habash, N., Alqumsan, A. A., & Zhou, T. (2025). Recent Real-Time Aerial Object Detection Approaches, Performance, Optimization, and Efficient Design Trends for Onboard Performance: A Survey. Sensors, 25(24), 7563. [6] Pagire, V., Chavali, M., & Kale, A. (2025). A comprehensive review of object detection with traditional and deep learning methods. Signal Processing, 237, 110075 [7] Kass, M., Witkin, A., & Terzopoulos, D. (1988). Snakes: Active contour models. International journal of computer vision, 1(4), 321-331 [8] Mirzaei, B., Nezamabadi-Pour, H., Raoof, A., & Derakhshani, R. (2023). Small object detection and tracking: a comprehensive review. Sensors, 23(15), 6887. [9] P. Viola and M. Jones, \"Rapid object detection using a boosted cascade of simple features,\" Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, HI, USA, 2001, pp. I-I, doi: 10.1109/CVPR.2001.990517. [10] N. Dalal and B. Triggs, \"Histograms of oriented gradients for human detection,\" 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR\'05), San Diego, CA, USA, 2005, pp. 886-893 vol. 1, doi: 10.1109/CVPR.2005.177. [11] R. Girshick, J. Donahue, T. Darrell, and J. Malik, \"Rich feature hierarchies for accurate object detection and semantic segmentation,\" in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2014.[10] [12] S. Ren, K. He, R. Girshick and J. Sun, \"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,\" in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1 June 2017, doi: 10.1109/TPAMI.2016.2577031. [13] He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969). [14] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, \"You Only Look Once: Unified, Real-Time Object Detection,\" 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 779-788, doi: 10.1109/CVPR.2016.91. [15] Liu, W. et al. (2016). SSD: Single Shot MultiBox Detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9905. Springer, Cham. https://doi.org/10.1007/978-3-319-46448-0_2 [16] Bolya, D., Zhou, C., Xiao, F., & Lee, Y. J. (2019). Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9157-9166). [17] Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016, September). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP) (pp. 3464-3468). Ieee. [18] Wojke, N., Bewley, A., & Paulus, D. (2017, September). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645-3649). IEEE. [19] Wang, Z., Zheng, L., Liu, Y., Li, Y., & Wang, S. (2020, August). Towards real-time multi-object tracking. In European conference on computer vision (pp. 107-122). Cham: Springer International Publishing. [20] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020, August). End-to-end object detection with transformers. In European conference on computer vision (pp. 213-229). Cham: Springer International Publishing. [21] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020, August). End-to-end object detection with transformers. In European conference on computer vision (pp. 213-229). Cham: Springer International Publishing. [22] Meinhardt, T., Kirillov, A., Leal-Taixe, L., & Feichtenhofer, C. (2022). Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8844-8854). [23] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014, September). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Cham: Springer International Publishing. [24] Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2), 303-338.A. Kuznetsova et al., \"The Open Images Dataset V4,\" International Journal of Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020. [25] Geiger, A., Lenz, P., & Urtasun, R. (2012, June). Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3354-3361). IEEE. [26] Geiger, A., Lenz, P., & Urtasun, R. (2012, June). Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3354-3361). IEEE. [27] Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. [28] Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., & Sorkine-Hornung, A. (2016). A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 724-732). [29] Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., ... & Yu, F. (2015). Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012. [30] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., ... & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3213-3223). [31] Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., ... & Sánchez, C. I. (2017). A survey on deep learning in medical image analysis. Medical image analysis, 42, 60-88. [32] C. Badue et al., \"Self-driving cars: A survey,\" Expert Systems with Applications, vol. 165, p. 113816, 2021. [33] Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.P. Voigtlaender et al., \"MOTS: Multi-Object Tracking and Segmentation,\" in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019. [34] Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., ... & Wang, X. (2022, October). Bytetrack: Multi-object tracking by associating every detection box. In European conference on computer vision (pp. 1-21). Cham: Springer Nature Switzerland. [35] Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016, October). Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision (pp. 17-35). Cham: Springer International Publishing.

Copyright

Copyright © 2026 Sachin Chavan, Dr. Ramesh Manza. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET76991

Publish Date : 2026-01-16

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here