The identification of age and gender from facial images is also difficult because of the variations in lighting condition, different facial poses, the presence of occlusions, as well as the problem of imbalanced gender data set. Consequences of inaccurate prediction include the following: This could be observed in surveillance systems resulting in the failure of offering a productive and effective individual marketing the result of these inaccuracies could also be seen in social and demographic studies where unsuitable results could be generated. To overcome these difficulties, this paper utilizes a novel CNN architecture that is designed specifically for the task. The CNN model design includes convolutional and max pooling layers and fully connected layers with dropout to reduce the overfitting problem. This means that the accuracy and reliability of age and gender prediction is improved by the use of this approach, making it suitable for security services, marketing realms and human-computer interface systems. There exists a reasonable improvement in the model when benchmark datasets are applied; thus, the custom CNN model is a reliable tool for practical tasks.
Introduction
This project presents a real-time computer vision system that performs face detection, recognition, age and gender estimation, emotion analysis, and scene description using a live webcam feed. The system integrates multiple machine learning models into a unified pipeline built with a Flask backend and a React.js frontend, ensuring smooth real-time video processing.
The backend uses Dlib’s HOG-based face detector and 128-dimensional face embeddings for recognition, a CNN trained on the UTKFace dataset for age and gender prediction, and the DeepFace library for emotion classification. Each detected face is processed in real time, and results such as identity, age range, gender, emotion, and confidence scores are displayed along with a natural language description of the scene. All results are stored in a SQLite database for analytics and historical tracking.
The frontend is designed to maintain smooth video playback by separating video rendering from detection processing. Face annotations are drawn asynchronously on a canvas, preventing lag even when backend processing is slower. A dashboard displays live results, analytics charts, and detection logs.
The literature review shows the evolution of face analysis from traditional methods like Viola–Jones and HOG-based detection to modern deep learning approaches such as FaceNet, ArcFace, and CNN-based age/emotion models. While advanced models offer higher accuracy, this system prioritizes efficiency and real-time performance using lightweight and CPU-friendly techniques.
The methodology involves capturing frames, detecting faces, extracting regions, running multiple models in parallel, and combining outputs into structured results. Scene understanding is generated by analyzing background color, lighting, and face proportions.
Conclusion
The most significant technical contribution of the implementation is the single-canvas rendering architecture that completely decouples the video display from the detection pipeline. This insight, arrived at after observing the stutter caused by replacing the video element with an annotated snapshot on each detection cycle, transformed the user experience from choppy and jarring to smooth and professional. The coordinate mapping solution, which uses a fixed API capture resolution (640x360) and scales face coordinates using the ratio of the display canvas width to this fixed width, eliminates the class of position-offset bugs that plagued earlier implementation attempts.
The UTKFace dataset proved well-suited for training the multi-output CNN. The filename-embedded label format (age_gender_race_datetime.jpg) made dataset loading simple, and the wide age range and demographic diversity gave the model reasonable generalisation. The systematic upward age prediction bias, while requiring an empirical calibration offset, is a well-documented characteristic of models trained on this dataset and does not reflect a fundamental flaw in the approach.
The integration of DeepFace for emotion detection demonstrated the value of established library abstractions: the library handles model downloading, preprocessing, inference, and multiple backend support transparently. The numpy.float32 serialisation issue it introduced was an instructive reminder that integrating libraries from different ecosystems requires careful attention to data type compatibility at interface boundaries.
Across all tested scenarios, the system delivered a detection rate of 1-2 detections per second on CPU hardware with smooth 30fps video, greater than 90% gender classification accuracy on frontal faces in adequate lighting, and age predictions that fellwithin the displayed ±4-year range for the majority of test subjects. Face recognition operated near perfectly for registered individuals photographed under similar conditions to their registration image.
References
[1] Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. Proceedings of the 2001 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1, 511-518.
[2] Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. CVPR 2005, 1, 886-893.
[3] King, D. E. (2009). Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10, 1755-1758.
[4] Taigman, Y., Yang, M., Ranzato, M. A., & Wolf, L. (2014). DeepFace: Closing the gap to human-level performance in face verification. CVPR 2014, 1701-1708.
[5] Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. CVPR 2015, 815-823.
[6] Levi, G., & Hassner, T. (2015). Age and gender classification using convolutional neural networks. CVPR Workshops 2015, 34-42.
[7] Rothe, R., Timofte, R., & Van Gool, L. (2015). DEX: Deep Expectation of apparent age from a single image. ICCV Workshops 2015.
[8] Zhang, Z., Song, Y., & Qi, H. (2017). Age progression/regression by conditional adversarial autoencoder. CVPR 2017, 5810-5818.
[9] Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). ArcFace: Additive angular margin loss for deep face recognition. CVPR 2019, 4690-4699.
[10] Ekman, P., & Friesen, W. V. (1978). Facial Action Coding System: A technique for the measurement of facial movement. Consulting Psychologists Press.
[11] Goodfellow, I. J., et al. (2013). Challenges in representation learning: A report on three machine learning contests. Neural Networks, 64, 59-63.
[12] Serengil, S. I., & Ozpinar, A. (2021). HyperExtended LightFace: A facial attribute analysis framework. 2021 International Conference on Engineering and Emerging Technologies (ICEET).
[13] Abadi, M., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available at tensorflow.org. https://www.tensorflow.org
[14] Chollet, F. (2015). Keras. GitHub Repository. https://github.com/keras-team/keras
[15] Bradski, G. (2000). The OpenCV library. Dr. Dobb\'s Journal of Software Tools. https://opencv.org
[16] Geitgey, A. (2018). face_recognition: The world\'s simplest facial recognition API for Python and the command line.
https://github.com/ageitgey/face_recognition
[17] UTKFace Dataset. (2017). Large scale face dataset. AICIP Research Group, University of Tennessee. https://susanqq.github.io/UTKFace/