Binary face-mask detection systems, which saw widespread adoption during the COVID-19 pandemic, are now largely insufficient for the demands of modern workplace safety monitoring. Industrial environments require real-time verification of multiple PPE categories simultaneously — not just a yes/no determination of whether a face covering is present. This paper presents a 3-class PPE compliance detection system that categorises each detected worker into one of three states: correctly wearing required PPE, wearing PPE incorrectly (below the nose, around the chin, or loosely fitted), and not wearing PPE at all. The proposed pipeline pairs a YOLOv8 detection head with a Vision Transformer (ViT-B/16) classification backbone pretrained on ImageNet-21K, fine-tuned on a curated dataset of 4,200 annotated images across the three compliance categories. Albumentations-based augmentation including mosaic, cutout, and histogram equalisation improves robustness under poor lighting. After 30 training epochs using the AdamW optimiser with cosine learning rate decay, the system achieves 99.3% test accuracy with a macro-F1 of 0.991. The fine-tuned model is subsequently quantised to INT8 TFLite format and deployed on a Raspberry Pi 4, achieving 12 FPS — sufficient for practical monitoring applications. An integrated Streamlit dashboard and Telegram bot deliver real-time compliance alerts. This work demonstrates that extending face-mask detection into a full PPE compliance framework, powered by transformer-based architectures and edge-optimised deployment, is both technically feasible and operationally practical for factories, hospitals, and construction sites.
Introduction
The text presents a modern PPE (Personal Protective Equipment) detection system that moves beyond traditional binary mask detection toward a more realistic and operationally useful classification framework.
Earlier COVID-era systems (2020–2022) mainly used CNNs like MobileNetV2 or VGG to detect whether a mask is present or not. While these achieved high accuracy, they failed in real-world scenarios because they could not distinguish correct vs incorrectly worn PPE, such as masks worn below the nose or helmets not properly secured.
To address this limitation, the proposed system introduces three key improvements:
Expanded classification (3 classes):
Correct PPE usage
Incorrect PPE usage
No PPE
Modern architecture upgrade:
A hybrid model combining:
YOLOv8n for real-time face detection
Vision Transformer (ViT-B/16) for robust classification using attention mechanisms
Edge deployment:
The system is optimized for real-time performance on a Raspberry Pi 4, using INT8 quantization to achieve around 12 FPS without cloud dependency, along with a Streamlit dashboard and Telegram alerts.
Related Work Summary
The literature review shows an evolution in PPE detection methods:
Early CNN-based models achieved high accuracy but lacked real-world robustness.
YOLO-based systems improved speed and object detection but still struggled with subtle compliance cases like improper wear.
Vision Transformers and hybrid CNN-ViT models improved performance in complex, occluded, and fine-grained classification tasks.
Edge AI research shows increasing focus on running lightweight, quantized models on devices like Raspberry Pi for privacy and real-time use.
Methodology Overview
The problem is framed as a 3-class classification task on face images.
A dataset of 4,200 images is built from multiple sources, including correctly and incorrectly worn PPE examples.
Strong data augmentation (rotation, blur, cutout, mosaic, brightness shifts) is used to improve generalization.
The system pipeline:
YOLOv8 detects faces in video frames
Cropped faces are passed to ViT-B/16 for PPE classification
The model is trained with weighted loss to handle class imbalance.
Key Idea
The core contribution is shifting PPE detection from a simple “mask/no mask” system to a context-aware compliance system that detects whether protective equipment is actually worn correctly, making it more suitable for real industrial and healthcare safety monitoring.
Conclusion
This paper has presented a 3-class PPE compliance detection system that extends the well-established face mask detection literature in two practically important directions: multi-class compliance classification and edge device deployment. The system pairs a YOLOv8n face detector with a ViT-B/16 classification backbone pretrained on ImageNet-21K, fine-tuned on a 4,200-image dataset spanning correct wear, incorrect wear, and absent PPE. Training with Albumentations-based augmentation and AdamW optimisation over 30 epochs achieved 99.3% test accuracy and a macro-F1 of 0.993. Quantisation to INT8 TFLite reduced model size by 75% and enabled real-time deployment on a Raspberry Pi 4 at 12 FPS, integrated with a Streamlit compliance dashboard and Telegram alert system.
The work demonstrates that transformer-based architectures are well suited to fine-grained PPE compliance classification, particularly for the difficult incorrect-wear category that binary systems cannot address. Practical limitations remain around low-light detection and edge hardware throughput, both of which can be addressed through supplementary illumination and accelerator hardware upgrades respectively.
Several extensions are planned. The most immediate priority is expanding the class set beyond face masks to include helmets, safety vests, gloves, and goggles, building toward a general-purpose multi-label PPE compliance system for industrial environments. Integration with access control systems — denying entry to a restricted zone when a compliance violation is detected — would close the loop from detection to enforcement. The annotated dataset collected for this work will be released publicly to support reproducibility and to address the documented scarcity of incorrect-wear labelled training data in the field.
References
[1] A. Das, M. W. Ansari, and R. Basak, \'COVID-19 face mask detection using TensorFlow, Keras and OpenCV,\' in Proc. IEEE INDICON, New Delhi, India, 2020.
[2] S. Hussain et al., \'Face mask detection using deep convolutional neural network and MobileNetV2-based transfer learning,\' Wireless Communications and Mobile Computing, vol. 2022, Art. no. 1536318, 2022.
[3] N. Ghosh, B. Jana, S. Jana, and N. K. Sao, \'Face mask detection exploiting CNN and MobileNetV2,\' Lecture Notes in Networks and Systems, vol. 738, Springer, 2024.
[4] P. Nagrath et al., \'SSDMNV2: A real time DNN-based face mask detection system using single shot multibox detector and MobileNetV2,\' Sustainable Cities and Society, vol. 66, p. 102692, 2021.
[5] M. Loey, G. Manogaran, M. H. N. Taha, and N. E. M. Khalifa, \'Fighting against COVID-19: A novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection,\' Sustainable Cities and Society, vol. 65, p. 102600, 2021.
[6] A. Kanavos, O. Papadimitriou, K. Al-Hussaeni, M. Maragoudakis, and I. Karamitsos, \'Real-time detection of face mask usage using convolutional neural networks,\' Computers, vol. 13, no. 7, p. 182, 2024.
[7] B. U. H. Sheikh and A. Zafar, \'Beyond accuracy and precision: a robust deep learning framework to enhance the resilience of face mask detection models against adversarial attacks,\' Evolving Systems, vol. 15, pp. 1–24, 2024.
[8] M. Vukicevic et al., \'A systematic review of computer vision-based personal protective equipment compliance in industry practice,\' Artificial Intelligence Review, Springer, 2024.
[9] N. Amangeldy et al., \'Personal protective equipment detection using YOLOv8 architecture on object detection benchmark datasets: a comparative study,\' Cogent Engineering, vol. 11, no. 1, 2024.
[10] Y. Wei, H. Li, Y. He, et al., \'Robust face mask detection in complex scenarios using YOLOv8 and context-aware convolutions,\' Scientific Reports, vol. 15, no. 21350, 2025.
[11] Benchmarking lightweight YOLO object detectors for real-time hygiene compliance monitoring, PMC / MDPI, 2025.
[12] A. Dosovitskiy et al., \'An image is worth 16x16 words: Transformers for image recognition at scale,\' in Proc. ICLR, 2021.
[13] H. Touvron et al., \'Training data-efficient image transformers and distillation through attention,\' in Proc. ICML, 2021.
[14] Z. Liu et al., \'Swin Transformer: Hierarchical vision transformer using shifted windows,\' in Proc. IEEE ICCV, pp. 10012–10022, 2021.
[15] X. Li et al., \'EfficientViT: Lightweight multi-scale attention for on-device semantic segmentation,\' in Proc. IEEE CVPR, 2023.
[16] Systematic review of hybrid Vision Transformer architectures for radiological image analysis, PMC / SIIM, 2025.
[17] D. Alqahtani et al., \'Benchmarking deep learning models for object detection on edge computing devices,\' arXiv:2409.16808, 2024.
[18] S. Saha and L. Xu, \'Vision Transformers on the edge: A comprehensive survey of model compression and acceleration strategies,\' Neurocomputing, 2025.
[19] M. Witkowski, \'Medical face mask detection dataset,\' Kaggle, 2020. [Online]. Available: https://www.kaggle.com/datasets/mloey1/medical-face-mask-detection-dataset
[20] A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, and A. A. Kalinin, \'Albumentations: Fast and flexible image augmentations,\' Information, vol. 11, no. 2, p. 125, 2020.
[21] I. Loshchilov and F. Hutter, \'Decoupled weight decay regularization,\' in Proc. ICLR, 2019.
[22] Deploying optimized deep vision models for eyeglasses detection on low-power platforms, Electronics (MDPI), vol. 14, no. 14, 2025.