Authors: Anjali Suthar, Prof. Jayandrath Mangrolia, Prof. Ravi Patel
Certificate: View Certificate
The method of analysing human motion using computer and machine vision technologies is known as \"human activity recognition,\" or HAR. One of the applications of human activity recognition in security systems is anomaly detection. Surveillance cameras have been widely placed as the foundation for video analysis as the demand for security has grown. Identifying aberrant behaviour necessitates considerable human effort, which is one of the major challenges in surveillance video analysis. It is important to set up video recording in order to detect unusual activity automatically. Our intelligent video surveillance system can detect an abnormality in a video using deep learning technologies. Real-time detection of activities is also conceivable, and these video frames will be saved in the system as images for the user to study. The proposed Abnormal Activity Recognition system was designed with the purpose of finding and detecting anomalies in the financial industry, especially in an ATM context, using a live stream. The first part of the research focuses on the use of image deep learning algorithms to recognise different products and detect anomalous behaviour utilizing ATM monitoring systems
Real-time prediction of the presence of one or more objects, along with their classes and bounding boxes, is the task of computer vision that has taken the industry by storm. Object detection can use a neural network to classify and localize an object in the image. Benefitting from this capability, there is a tremendous amount of work that is being done in the different streams of life from facial recognition to autonomous driving cars, security applications and robotics.Modern detectors have been in the development to identify the objects in higher frame rate.
In this paper, proposes a deep learning-based system for detecting suspicious events in a bank-ATM context in real time. Bounding boxes, which functioned as classes in this case, are utilized to detect tagged items. This is then used to categories labels in video and forecast whether the occurrences are normal or abnormal. that result is calculated using the Motion representation Depth data is derived from the classes' bounding boxes. Then multi-stream CNNs are used to distinguish constituents and actions. The choosing of an appropriate algorithm for a certain job.
The results of this implementation were pretty remarkable, since the maximum accuracy and speed were seen. In contrast to traditional object detection utilising static pictures, video object detection detects things using video data. Autonomous driving and video surveillance are two applications that have played a significant influence in the advancement of video object detection.
Detecting objects in video required conducting object detection on each picture frame. Object detection techniques may be divided into two categories: (1) one-stage detectors and (2) two-stage detectors. One-stage detectors are frequently more computationally efficient than two-stage detectors. However, two-stage detectors have been found to provide greater accuracies than one-stage detectors.
However, using object detection on each image frame does not take into consideration the following attributes in video data: (1) Since there exist both spatial and temporal correlations between image frames, there are feature extraction redundancies between adjacent frames. Detecting features in each frame leads to computational inefficiency. (2) In a long video stream, some frames may have poor quality due to motion blur, video defocus, occlusion, and pose changes. Detecting objects from poor quality frames leads to low accuracies. Approaches for video object recognition make an effort to solve the aforementioned problems. Some methods, such feature fusion on several layers, employ the spatial-temporal information to increase accuracy. Other strategies concentrate on enhancing detection effectiveness and eliminating information redundancy. Moving forward, the YOLOv5, YOLO-6 and YOLOv7 will be talked about respectively.
However, there is always a trade-off between speed and accuracy among these methods., YOLO versions were developed and in each version there was a speed accuracy trade off. Furthermore, methodology improvements and structure of YOLO will be discussed. Afterwards, we will compare the performance of all three models to analyze which one is the most accurate.
II. RELATED WORK
The YOLO algorithm uses convolutional neural networks (CNN) to quickly identify objects. The approach just needs one forward propagation through a neural network, as the name would imply, to detect objects.
A. Object Detection with Deep Learning
Deep learning has been widely employed in artificial intelligence object identification, which is the process of identifying and finding objects in digital photos or videos. Deep learning neural networks for object detection are trained on huge datasets of labeled photos, where the algorithms learn to recognize things by extracting features such as edges, corners, textures, and colors from the images. These traits are then utilized to forecast the existence and placement of items in previously unseen images.
Furthermore, detecting items that take up between 2% and 60% of an image's area is an area where object detection excels. It is also very efficient at detecting items with distinct borders Additionally, it detects groups of objects as a single item and performs object localization at high speed (>15fps).
Furthermore, object detection is becoming more prevalent in a range of industries, with applications ranging from company productivity to personal security. Convolutional Neural Networks (CNNs), for example, have shown remarkable success in achieving high accuracy and real-time performance in a wide range of applications, including autonomous cars, surveillance systems, and face recognition systems. Because they can automatically learn and adapt to different item classes, orientations, sizes, and lighting conditions, these algorithms are particularly successful at recognising objects in complicated and dynamic situations.
B. Introduction to yolo family
RCNN models were the most popular models for object detection at the time. Although the RCNN family of models was accurate, it was very sluggish due to the multi-step process of locating the recommended region for the bounding box, classification on these regions, and lastly post-processing to enhance the result.
YOLO was established with the purpose of eliminating multistage detection and doing object detection in a single stage, hence improving inference time.
YOLOv1 sported a 63.4 mAP with an inference speed of 45 frames per second (22ms per image). At that time, it was a huge improvement of speed over the RCNN family for which inference rates ranged from 143ms to 20 seconds.
b. Technical Improvements
The YOLO model's primary operation is based on its unified detection approach, which combines several components of object identification into a single feed neural network.
The model breaks an input picture into several grids and assesses the likelihood that an object will be found within each grid. This is repeated for all of the grids into which the image is split. The programme then aggregates neighbouring high-value probability grids into a single item. Low-value predictions are eliminated using a method known as Non-Max Suppression (NMS).
The model is trained in a similar fashion where the centre of each object detected is compared with the ground truth. In order to check whether the model is correct or not and adjust the weights accordingly.
2. YOLOv2 – Better, Faster, Stronger
YOLOv2 was capable of detecting over 9000 different item types. This version improved on the previous version YOLOV1.
On the VOC 2012 dataset, YOLOv2 achieved a performance of 78.6 mAP. The table below shows that it outperformed other item detection algorithms on the VOC 2012 dataset.
b. Technical Improvements
YOLOv2 version introduced the concept of anchor boxes. Anchor boxes are nothing but predefined areas for an image that illustrates the idealized position of the objects to be detected. We calculate the ratio of overlap over union (IoU) of the predicted bounding box and the pre-defined anchor box. The IoU value acts as a threshold to decide whether the probability of the detected object is sufficient to make a prediction or not.
c. Additional Improvements
3. YOLOv3: An Incremental Improvement
This model was a little bigger than the earlier ones but more accurate and yet was fast enough.
YOLOv3-320 has an mAP of 28.2 with an inference time of 22 milliseconds. (On the COCO dataset). This is 3 times fast than the SSD object detection technique yet with similar accuracy Comparisons
b. Technical Improvements
YOLOv3 consisted of 75 convolutional layers without using fully connected or pooling layers which greatly reduced the model size and weight. It provided the best of both worlds i.e. using residual models (from the ResNet model) for multiple feature learning with feature pyramid network(FPN) while maintaining minimal inference times.
A feature pyramid network is a feature extractor that extracts different types/forms/sizes of features for a single image. It concatenates all the features so that the model can learn local and general features.
By employing the use of logistic classifiers and activations the class predictions for the YOLOv3 goes above and beyond RetinaNet-50 and 101 in terms of accuracy.As the backbone, the YOLOv3 model uses the Darknet53 architecture.
4. YOLOv4 – Optimal Speed and Accuracy of Object Detection
The YOLOv4 model outperforms other detection methods such as efficientDet and ResNext50. It is powered by the Darknet53 backbone (the same as the YOLOv3).
b. Technical Advancements
The bag of freebies (techniques that improve model performance without raising inference cost) and the bag of specials (techniques that improve accuracy while increasing computation cost) were added in YOLOv4.
On the COCO dataset, it has a frame rate of 62 frames per second and a mAP of 43.5 percent.
c. Bag of Freebies (BOF)
d. Bag of Specials BOS
5. YOLOv5: Latest YOLO?
YOLOv5 is rumoured to be the next member of the YOLO family to be launched in 2020 by Ultranytics, only a few days after YOLOv4. No paper has been produced, and there is some dispute in the community over whether the use of the YOLO trademark is justified given that it is only the PyTorch implementation of YOLOv3.
Because there is no official document yet, the legitimacy of the performance cannot be verified. It achieves the same, if not higher, accuracy (mAP of 55.6) than the other YOLO models while using less computer resources.
b. Technical Improvements
III. METHOD USED
In this section, we will study about architecture of yolov5 and find out why yolov5 is capable for object detection. The inference time and accuracy of the model was carefully observed and then was compared with other models.
A. Improvements in Yolov5
Any computer vision enthusiast has surely heard of YOLO models for object detection. Ever since the first YOLOv1 was introduced in 2015, it garnered too much popularity within the computer vision community. Subsequently, multiple versions of YOLOv2, YOLOv3, YOLOv4, and YOLOv5 have been released albeit by different people. In this article, we will give a brief background about all the object detection models of the YOLO family from YOLOv1 to YOLOv5.
B. Basic Working of YOLO Object Detector Models
Aside from that, the architecture of all YOLO models follows a similar pattern of components, as seen below -
a. Backbone: A convolutional neural network that collects and generates visual features of various shapes and sizes. As feature extractors, classification models such as ResNet, VGG, and EfficientNet are employed.
b. Neck: A group of layers that combine and mix properties before passing them on to the prediction layer. Examples include the feature pyramid network (FPN), the path aggregation network (PAN), and the Bi-FPN.
c. Head: Includes neck characteristics as well as bounding box forecasts. To finish the detection process, performs classification and regression on the features and bounding box coordinates. Outputs four values, usually x and y coordinates along with width and height.
C. Data Augmentation in YOLOv5
YOLOv5 runs training data through a data loader, which augments data online, with each training batch. The data loader performs three types of augmentations:
The most unique is mosaic data augmentation, which mixes four photos into four random ratio tiles.
Mosaic augmentation is particularly effective for the widely used COCO object identification benchmark, assisting the model in learning to overcome the well-known "small object problem," in which little things are not spotted as reliably as bigger ones.
It is worth mentioning that experimenting with your own set of augmentations to maximise performance on your particular work is worthwhile.
Here's a screenshot of enhanced training pictures in YOLOv5.
D. Bounding Box Anchors That Learn On Their Own
The YOLOv5 network predicts bounding boxes as deviations from a set of anchor box dimensions in order to produce box predictions.
The most severe disparity in anchor boxes may arise while attempting to identify giraffes that are very tall and narrow, or manta rays that are very wide and flat. When you enter custom data into YOLOv5, all YOLO anchor boxes are auto-learned.
IV. RESULTS AND COMPARISION
When applied to a sample image, YOLO algorithms. Please check the Supplementary Material at the conclusion of Section 4 for additional photographs and a video. Figure 5 also depicts the performance of YOLO algorithms in both PC and CC. Table A1 also has detailed findings, which demonstrate the average precision of the three YOLO algorithms for all labels. Furthermore, Table 3 displays the accuracy and recall of the algorithms; YOLOv3 has a high precision but a poor recall, indicating that the model has to be improved. For an algorithm to be considered efficient in our work, it must strike a balance between precision and recall, which is represented in the method's F1 score. Precision and recall are balanced in YOLOv4 and YOLOv5l, as shown. As a result, YOLOv4 and YOLOv5l have greater F1 scores than YOLOv3, while having higher accuracy. The models in YOLOv4 and YOLv5 have balanced precision and recall, resulting in a good F1 score.
V. FUTURE WORK
In this study, we conducted a broad literature review on object identification algorithms, their many versions, and their diverse needs. Based on the results of the survey, we discovered the following concerns with real-time object identification and tracking: 1. The majority of existing algorithms are image/video-based. It loses some information while extracting features from images/videos. As a result, detection and tracking are difficult. 2. Some detection and tracking algorithms can identify and track several objects while also dealing with occlusion. However, greater computational and memory requirements are required.
Based on the comparison of yolov model, the yolov5 version is best suitable for proposes a deep learning-based system for detecting suspicious events in a bank-ATM context in real time. Bounding boxes, which functioned as classes in this case, are utilised to detect tagged items. This is then used to categorise labels in video and forecast whether the occurrences are normal or abnormal. that result is calculated using the Motion representation Depth data is derived from the classes' bounding boxes. Then multi-stream CNNs are used to distinguish constituents and actions. The choosing of an appropriate algorithm for a certain job.
While working on an object identification module that would determine favorable and poor landing places in real-time. Based on prior relevant work, we were unable to determine the object identification technique that performs best in this application while meeting the desired safety requirements. As a result, we chose YOLOv3, YOLOv4, and YOLOv5 because of their high detection speed and accuracy in real-time applications, and we compared their accuracy and speed to see which method works best for emergency landing place identification. Based on the findings of our research, as shown in Fig 7,8,9 we select the algorithm with the best accuracy, YOLOv5. In the future work, an establish the viability of combining YOLOv5 with the ATM -I dataset for fast and accurate object recognition with multiple object and actions.YOLOv5 models are best approach for object detection and tracking.
 Vikas Tripathi; Hindawi Publishing Corporation, \"Robust Abnormal Event Recognition via Motion and Shape,\" Journal of Electrical and Computer Engineering, pp. 1-11, 2015.  Pushpajit A. Khaire and Praveen Kumar, \"RGB+D and deep learning based real time detection of suspicious,\" Springer; Journal of Real-Time Image Processing, pp. 1-13, 2021.  P. A. Khaire, \"RGB+D and deep learning based real time detection of suspicious,\" Journal of Real-Time Image Processing, pp. 1-13, 21.  C. Shiranthika, \"Human Activity Recognition Using CNN & LSTM,\" IEEE, 2021.  T. S. Bora, \"HUMAN SUSPICIOUS ACTIVITY DETECTION SYSTEM USING CNN MODEL FOR VIDEO SURVEILLANCE,\" IJARIIE, 2021.  R. Vrskova, \"A New Approach for Abnormal Human Activities Recognition,\" Sensor, 2022.  S. Sabbu, \"LSTM-Based Neural Network to Recognize Human Activities,\" Hindawi, pp. 1-8, 2022.  Rajeshwari S, Vismitha G, Sumalatha G and Safura Aliya, “Unusual Event Detection for Enhancing ATM Security,” International Journal of Innovative Research in Electrical, Electronics, Instrumentation and Control Engineering, pp. 1-6, 2021.  J. R. Kwapisz, G. M. Weiss, and S. A. Moore, “Activity recognition using cell phone accelerometers,” SIGKDD Explor. Newsl., vol. 12, no. 2, pp. 74–82, Mar. 2011, doi: 10.1145/1964897.1964918.  A. Murad and J.-Y. Pyun, “Deep Recurrent Neural Networks for Human Activity Recognition,” Sensors, vol. 17, no. 11, p. 2556, Nov. 2017, doi: 10.3390/s17112556  P. Kuppusamy and C. Harika, “Human Action Recognition using CNN and LSTM-RNN with Attention Model” International Journal od Innovative Technology and Exploring Engineering(IJITEE), vol.8,Issue 8, pp.1639-1643, 201  https://www.analyticsvidhya.com/blog/2022/03/basics-of-cnn-in-deep-learning  Y. Chen, K. Zhong, J. Zhang, Q. Sun, and X. Zhao, “LSTM Networks for Mobile Human Activity Recognition,” presented at the 2016 International Conference on Artificial Intelligence: Technologies and Applications, Bangkok, Thailand, 2016, doi: 10.2991/icaita- 16.2016.13  https://ieeexplore.ieee.org/document/904397  https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939  C. Jobanputra, J. Bavishi, and N. Doshi, “Human Activity Recognition: A Survey,” Procedia Computer Science, vol. 155, pp. 698–703, 2019, doi: 10.1016/j.procs.2019.08.100  https://deepai.org/publication/evaluating-two-stream-cnn-for-video-classificatio  https://www.codeproject.com/Articles/1366433/Using-Modified-Inception-V3-CNN-for-Video-Processin  https://www.kaggle.com/datasets/mehantkammakomati/atm-anomaly-video-dataset-atma  A. Murad and J.-Y. Pyun, “Deep Recurrent Neural Networks for Human Activity Recognition,” Sensors, vol. 17, no. 11, p. 2556, Nov. 2017, doi: 10.3390/s17112556  T. Zebin, M. Sperrin, N. Peek, and A. J. Casson, “Human activity recognition from inertial sensor time-series using batch normalized deep LSTM recurrent networks,” in 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, Jul. 2018, pp. 1–4, doi: 10.1109/EMBC.2018.8513115.  https://github.com/pjreddie/darknet/blob/master/data/coco.names  https://machinelearningknowledge.ai/a-brief-history-of-yolo-object-detection-models  https://www.irjet.net/archives/V8/i4/IRJET-V8I4809.pdf  M. Sabokrou, M. Fathy, M. Hoseini, and R. Klette, “Real-time anomaly detection and localization in crowdedness,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2015.  C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 fps in matlab ,” in Proceedings of the IEEEinternational conference on computer vision, 2013.  Lu, S. (2019). Deep learning for object detection in video Journal of Physics Conference Series, 1176.  Simonyan, K., Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos.
Copyright © 2023 Anjali Suthar, Prof. Jayandrath Mangrolia, Prof. Ravi Patel. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.