Rapid urbanisation and the growth of smart city infrastructure have put heavy pressure on traditional surveillance systems, which were never designed to handle current volumes and types of threats.[file:1] Conventional CCTV still relies on human operators to continuously watch multiple screens, flag incidents, and raise alarms, which does not scale to large camera networks.[file:1] Human attention typically drops after about twenty minutes of continuous monitoring, and feeds from hundreds of cameras cannot realistically be reviewed in real time.[file:1] Recorded footage also lacks structured metadata, so it cannot be searched or analysed automatically.[file:1]
To address these limitations, we built the Real-Time Intelligent Video Analytics System (RIVAS), which integrates three deep- learning models into a single, end-to-end pipeline.[file:1] YOLOv8 is used to locate people in each video frame, DeepFace with an ArcFace backbone identifies who they are, and MediaPipe Holistic estimates body posture to infer activities.[file:1] As each frame arrives, RIVAS detects all visible persons, checks whether their faces match an enrolled gallery, determines whether the observed activity is a fall, an unauthorised intrusion, loitering, or normal movement, and immediately pushes a structured alert via a Flask REST API to a Streamlit dashboard.[file:1] On a mixed indoor-outdoor dataset, the system achieved 75% precision for person detection, 95% for face identification, and 94% for activity classification while sustaining 24 frames per second at 720p resolution.[file:1] These results compare favourably with single-module systems and conventional CCTV installations and represent a practical step toward more autonomous, real-time surveillance.[file:1]
Introduction
Traditional surveillance relies heavily on human operators watching multiple screens, which is inefficient because attention drops over time and reviewing footage is slow. Existing systems also lack “smart” capabilities such as automatic event detection or structured metadata. However, recent advances in deep learning (like YOLO, face recognition models, and pose estimation) now make real-time intelligent surveillance possible.
To address this, the paper proposes RIVAS (a Real-Time Integrated Video Analysis System), which combines multiple AI modules into a single pipeline. It integrates:
YOLOv8 for real-time person detection
ArcFace/DeepFace for face recognition
MediaPipe for human pose estimation
A rule-based engine for threat detection (fall, intrusion, loitering, normal behavior)
A dashboard and alert system for real-time monitoring and notifications
The system processes video in stages: video input → preprocessing → object detection → face recognition → pose analysis → threat classification → alert generation → visualization dashboard. Alerts are triggered via a web API and SMS when dangerous events occur.
The literature review shows how surveillance technology evolved from hand-crafted methods (Haar, HOG, optical flow) to deep learning models (CNNs, Faster R-CNN, YOLO, FaceNet). It highlights that while each technique is strong individually, most systems fail to integrate them into a unified real-time pipeline.
Conclusion
This paper presented RIVAS, an intelligent video analytics framework that integrates YOLOv8-based person detection, ArcFace-based face identification via DeepFace, and Medi- aPipe Holistic pose estimation within an eight-stage processing pipeline.[file:1] The system is designed to mitigate three struc- tural weaknesses of traditional CCTV deployments: reliance on human operators, the absence of automatic recognition, and the lack of real-time alerting.[file:1] Experiments on a diverse four-environment dataset show 75% precision for per- son detection, 95% for face recognition, and 94% for activity classification, with end-to-end throughput of 24 FPS at 720p, which is 3–4× faster than an OpenCV-only baseline.[file:1] Analysis of errors highlights fall detection and the loitering– intrusion boundary as the most challenging cases, mainly due to the rule engine’s lack of temporal context.[file:1] Current limitations include reduced performance in low light, inability to handle masked faces, and lower throughput in very dense scenes.[file:1] Future work will focus on adding night-vision support, developing mask-aware face recognition, fusing IoT sensor data for multi-modal detection, migrating to cloud- native deployments with auto-scaling, replacing the hand- crafted rule engine with learned anomaly detection, and implementing multi-camera person re-identification.[file:1] With these enhancements, RIVAS can evolve into a comprehensive, production-ready surveillance platform for smart cities, indus- trial sites, and critical infrastructure.[file:1]
References
[1] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. IEEE CVPR, 2001, pp. 511–518.
[2] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE CVPR, 2005, pp. 886–893.
[3] Z. Zivkovic, “Improved adaptive Gaussian mixture model for back- ground subtraction,” in Proc. IEEE ICPR, 2004.
[4] D. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
[5] B. K. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelligence, vol. 17, pp. 185–203, 1981.
[6] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in NeurIPS, 2012, pp. 1097–1105.
[7] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.
[8] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE CVPR, 2016, pp. 779–788.
[9] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015.
[10] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. IEEE CVPR, 2015, pp. 815–823.
[11] O. M. Parkhi, A. Vedaldi, and A. Zi 815–823.
[12] O. M. Parkhi, A. Veda2015.
[13] B. Amos et al., “OpenFace: A general-purpose face recogni 815–823.
[14] O. M. Parkhi, A. Vedaort, 815–823.
[15] O. M. Parkhi, A. Vedaand S. Zafeiriou, “ArcFace: Additive angular margin loss for deep face recogni 815–823.
[16] O. M. Parkhi, A. Veda
[17] Google, “MediaPipe Holistic — Pose, face and hand tracking,” 2020. 815–823.
[18] O. M. Parkhi, A. Vedamserengil2020lightweight S. Serengil and A. Ozpinar, “DeepFace: A lightweight face recognition and facial attribute analysis framework,” in Proc. IEEE ASYU, 2020.