Conventional video surveillance setups demand round-the-clock human attention, making them impractical and unreliable at scale. This work introduces VisionSafe, a web-deployed intelligent monitoring platform that ingests recordedvideofootageandautonomouslydetermineswhetherdetectedactivitiesareSAFEorUNSAFEthrougha structured AI pipeline. The platform fuses frame-level object localization, skeletal pose extraction, and a trained activityclassifierintoacohesivesystem.Builtonafour-tierarchitecture—Presentation,Application,AI/ML,and Data—thesolutionde liversannotatedvideooutputs,instantWebSocket-drivenalerts,andper-userhistorythrough a React-powered interface connected to a FastAPI backend and a PostgreSQL data store. Testing confirms strong classificationaccuracyalongsidelow-latencyalertdelivery,establishingVisionSafeasapracticallyviablesolution for next-generation automated public safety monitoring.
Introduction
VisionSafe is an AI-powered video safety monitoring platform developed to automate the detection of unsafe activities in surveillance footage. Although CCTV systems are widely deployed in campuses, public spaces, industries, and transit areas, they typically function as passive recording devices that require constant human monitoring. This manual process is inefficient, mentally exhausting, and prone to missing critical incidents such as falls, aggressive behavior, unauthorized intrusions, and restricted-area violations.
To address these limitations, VisionSafe integrates modern computer vision technologies into a single, deployable platform. The system allows users to upload surveillance videos through a web interface, automatically analyzes them using AI models, and generates annotated videos along with real-time dashboard notifications. Events are classified as SAFE or UNSAFE, enabling rapid identification of potential security threats without continuous human supervision.
The platform combines three major AI components: YOLOv8 object detection for identifying people and vehicles, pose estimation using MediaPipe/OpenPose to extract human skeletal keypoints, and an activity classification module that evaluates body posture and movement patterns. Geometric features such as joint angles, limb ratios, and body symmetry are analyzed to classify activities. A rule-based refinement stage further improves prediction accuracy by validating results against predefined safety thresholds.
VisionSafe follows a four-tier architecture consisting of a React-based presentation layer, a FastAPI application layer, an AI/ML processing layer, and a PostgreSQL data layer. The system supports secure user authentication through JWT tokens, real-time alert delivery via WebSockets, structured report generation, and storage of video metadata and detection records.
Experimental evaluation was conducted on 50 surveillance videos containing normal activities and simulated unsafe events such as falls, aggressive actions, and boundary violations. The system achieved approximately 91.4% activity classification accuracy, 88.2% person detection mAP@0.5, an average processing time of 38 ms per frame, and alert latency below 500 ms. The false positive rate was about 6.8%, mainly due to temporary body postures resembling falls.
The results demonstrate that combining object detection, pose estimation, and activity classification provides an effective and computationally efficient approach to automated safety monitoring. The modular architecture also supports scalability and maintainability. However, current limitations include reliance on pre-recorded video uploads, lack of live camera stream integration, limited training data diversity, and dependence on manually defined safety thresholds. Future work will focus on live surveillance support, improved generalization across environments, adaptive learning-based thresholds, and enhanced multi-user scalability.
Conclusion
This paper described VisionSafe, a self-contained intelligent surveillance platform engineered to remove human bottlenecks from safety monitoring workflows. By uniting YOLOv8 object detection, skeleton-based pose extraction, and a supervised activity classifier inside a layered web architecture — React dashboard, FastAPI service,PostgreSQLstore,andWebSocketalertchannel—thesystemtransformsuploadedfootageintoactionable safety reports without manual review.
The principal contributions are threefold: (1) a unified end-to-end pipeline spanning raw video intake through annotated output and live alert dispatch; (2) a pose-geometry classifier augmented by domain rules that reached approximately91.4%activityclassificationaccuracy;and(3)avalidatedprototypedemonstratedacrossindoorand outdoor recording conditions. Together these outputs confirm that the platform meets its stated goal of delivering reliable automated safety analysis with minimal reliance on human operators.
Planned improvements span several dimensions: connecting to live RTSP feeds for true real-time monitoring; introducing a unified control panel for multi-camera deployments; adding role-differentiated access permissions suited to enterprise environments; extending reach through a companion mobile application; building automated retrainingpipelinesthatin corporatenewlylabelledincidents;portingtheinferenceenginetoedgehardwareforon- premise low-latency operation; and linking event triggers to external emergency-response systems. At scale, orchestrating distributed workloads via Kubernetes is identified as the target deployment model.
References
[1] J.Redmon,S.Divvala,R.Girshick,andA.Farhadi,\"YouOnlyLookOnce:Unified,Real-TimeObjectDetection,\"in Proc. IEEE CVPR, 2016, pp. 779–788.
[2] Z.Cao,T.Simon,S.-E.Wei,andY.Sheikh,\"RealtimeMulti-Person2DPoseEstimationusingPartAffinityFields,\"in Proc. IEEE CVPR, 2017, pp. 7291–7299.
[3] K.Sun,B.Xiao,D.Liu,andJ.Wang,\"DeepHigh-ResolutionRepresentationLearningforVisualRecognition,\"IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 10, pp. 3349–3364, 2021.
[4] C.Feichtenhofer,H.Fan,J.Malik,andK.He,\"SlowFastNetworksforVideoRecognition,\"inProc.IEEEICCV, 2019, pp.6202–6211.
[5] M.A.Moussa,M.B.Amor,andM.Ardabilian,\"HumanFallDetectionUsingRule-BasedClassificationofSkeleton Keypoints,\" in Proc. IEEE ICIAP, 2019.
[6] A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar,andL.Fei-Fei,\"Large-ScaleVideoClassificationwith Convolutional Neural Networks,\" in Proc. IEEE CVPR, 2014, pp. 1725–1732.
[7] G.Jocheretal.,\"UltralyticsYOLOv8,\"GitHubRepository,2023.[Online].Available: https://github.com/ultralytics/ultralytics
[8] F.Zhangetal.,\"MediaPipeHands:On-deviceReal-timeHandTracking,\"inProc.ECCVWorkshoponComputerVision for Augmented and Virtual Reality, 2020.
[9] M.Abadietal.,\"TensorFlow:ASystemforLarge-ScaleMachineLearning,\"inProc.12thUSENIXSymp.onOperating Sys. Design and Implementation, 2016, pp. 265–283.
[10] T.Chenetal.,\"ASurveyofVideo-BasedActivityRecognition:Datasets,MethodsandApplications,\"J.Vis.Commun. Image Represent., vol. 89, 2022.