Traditional classroom monitoring relies on manual observation, which is subjective, inconsistent, and impractical for large class sizes. This paper presents EduVision, a real-time student attention monitoring system that leverages computer vision and facial landmark analysis to objectively quantify classroom engagement. The proposed system utilizes MediaPipe Face Mesh to extract 468 facial landmarks per detected face and computes a weighted attention score based on three geometric parameters — yaw, pitch, and face visibility — with weights of 0.5, 0.3, and 0.2 respectively. A student is classified as attentive when the computed score meets or exceeds a configurable threshold of 0.65, with snapshots captured every five seconds maintaining a timestamped engagement log. Deployed with dual interface support comprising a standalone OpenCV monitoring window and a Streamlit web dashboard, EduVision provides a non-intrusive, cost-effective, and automated alternative to manual engagement tracking, requiring only a standard webcam.
Introduction
Recent advancements in Artificial Intelligence (AI) and computer vision have enabled the development of intelligent, data-driven classroom systems. Monitoring student attention during lectures is important for evaluating teaching effectiveness and improving learning outcomes, but traditional methods rely on subjective observation by teachers, which can be inconsistent and inefficient, especially in large classrooms. Existing automated attention-monitoring systems often depend on expensive hardware such as eye trackers or wearable sensors, making them unsuitable for widespread use in resource-limited educational institutions.
To overcome these limitations, the proposed system, EduVision, introduces a lightweight and non-intrusive webcam-based student attention monitoring framework. The system uses MediaPipe Face Mesh and OpenCV to perform real-time facial landmark detection and head pose estimation without requiring specialized equipment.
EduVision provides several key features:
Non-intrusive monitoring using a standard webcam without wearables or additional hardware.
Multi-metric attention scoring based on yaw, pitch, and face visibility.
Dual-interface architecture consisting of a live OpenCV monitoring window and a Streamlit web dashboard.
Automated session reporting with timestamped CSV and JSON logs for post-session analysis.
The literature review highlights earlier attention-monitoring systems. Eye-tracking systems using infrared sensors achieved high accuracy but required costly hardware. Other vision-based systems using RGB cameras and head pose estimation showed promising results but faced difficulties with lighting conditions, occlusions, and large classroom sizes.
EduVision follows a layered client-server-inspired architecture operating entirely on local systems without cloud dependency. The frontend includes:
A Streamlit web dashboard displaying real-time charts, analytics, and reports.
An OpenCV live HUD showing color-coded face boxes and statistics.
The backend processing pipeline is Python-based and consists of:
Face Analyzer module for extracting 468 facial landmarks using MediaPipe Face Mesh.
Attention Tracker module for session management and data logging.
HUD Renderer module for visual overlays on video frames.
The system processes webcam video at 1280×720 resolution, analyzing every alternate frame for efficiency. It computes three geometric attention metrics:
Yaw (head turning)
Pitch (head tilt)
Visibility (face area in the frame)
These metrics are combined into a weighted attention score:
Score = yaw × 0.5 + pitch × 0.3 + visibility × 0.2
Students with scores above 0.65 are classified as “active,” while others are marked “inactive.” The system continuously records snapshots and engagement data every five seconds during a session.
The output includes:
Real-time active/inactive face indicators using green and red bounding boxes.
Session statistics such as average, peak, and minimum attention levels.
Exportable CSV and JSON reports for analysis.
The Streamlit dashboard allows users to configure parameters such as class name, subject, student count, session duration, and attention threshold. It also provides visual analytics including trend graphs, donut charts, bar charts, and downloadable reports.
Conclusion
This paper presented EduVision, a real-time student attention monitoring system that leverages MediaPipe Face Mesh and OpenCV to objectively quantify classroom engagement through geometric head pose analysis.
The system offers a cost-effective, non-intrusive alternative to manual observation, requiring only a standard webcam and no additional hardware. With a weighted three-metric scoring model combining yaw, pitch, and facial visibility, accurate attention classification is achieved in real time. With dual interface support through OpenCV and Streamlit, automated CSV and JSON session reporting, and fully configurable detection parameters, EduVision demonstrates strong potential for practical deployment across modern smart classroom environments.
References
[1] V. Romano, A. Segalin, and M. Cristani, \"Automatic student engagement detection using eye-tracking and gaze estimation in classroom environments,\" IEEE Transactions on Learning Technologies, vol. 14, no. 3, pp. 312–324, 2021.
[2] H. Monkaresi, N. Bosch, R. A. Calvo, and S. K. D\'Mello, \"Automated detection of engagement using video-based estimation of facial expressions and heart rate,\" IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 15–28, 2017.
[3] A. Kamath, M. Biswas, and V. Balasubramanian, \"A crowdsourced approach to student engagement recognition in e-learning environments,\" in Proc. IEEE Winter Conf. Applications of Computer Vision (WACV), Lake Placid, NY, USA, 2016, pp. 1–9.
[4] A. Dhall, J. Hedges, R. Goecke, and T. Gedeon, \"Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal-based challenges,\" in Proc. ACM Int. Conf. Multimodal Interaction (ICMI), Utrecht, Netherlands, 2020, pp. 784–789.
[5] W. Liao, B. Hu, M. X. Yang, and X. He, \"Attention-based convolutional neural network for student behavior recognition in online learning,\" IEEE Access, vol. 7, pp. 108261–108270, 2019.
[6] C. Zhang, Y. Li, and H. Wang, \"Lightweight student attention monitoring using MediaPipe facial landmark detection for smart classroom applications,\" Journal of Educational Technology and Society, vol. 25, no. 2, pp. 45–58, 2022.
[7] Z. Cao, G. Hidalgo, T. Simon, S. E. Wei, and Y. Sheikh, \"OpenPose: Realtime multi-person 2D pose estimation using part affinity fields,\" IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172–186, 2021.
[8] G. Lugaresi et al., \"MediaPipe: A framework for building perception pipelines,\" arXiv preprint arXiv:1906.08172, 2019.
[9] G. Bradski, \"The OpenCV library,\" Dr. Dobb\'s Journal of Software Tools, vol. 25, pp. 120–125, 2000.
[10] F. Chollet, Deep Learning with Python. Shelter Island, NY, USA: Manning Publications, 2018.