This paper focuses on creating structured landmarks-based feature extraction using MediPpipe. MediaPipe is an open-source framework for building pipelines to perform computer vision inference over arbitrary sensory data such as video or audio. Hand and facial expression recognition play a significant role in various domains like Human-computer interaction, assistive technology and emotion analysis. Traditional datasets primarily rely on raw images, which pose challenges in terms of computational complexity and privacy concerns. This paper represents a alternative approach for dataset creation by extracting structured landmarks-based representation for hand gestures and facial expressions using MediaPipe
Introduction
Human communication includes significant non-verbal elements like gestures and facial expressions. Recognizing these cues accurately is essential for AI/ML applications such as sign language interpretation, virtual reality, and mental health monitoring. Traditional image-based datasets pose privacy and computational challenges. To address this, the paper presents a dataset generation method based on landmark extraction using MediaPipe.
Methodology:
Data Collection: Uses MediaPipe to extract 3D hand (21 points per hand) and face (468 points) landmarks.
Preprocessing: Normalization, error filtering, and confidence-based data cleaning ensure dataset quality.
Annotation: Each frame is labeled according to gesture or facial expression for supervised learning.
Storage: Extracted data is stored efficiently in CSV format, reducing data volume while preserving relevant information.
Applications:
These landmark-based datasets are ideal for:
Gesture recognition (e.g., sign language)
Facial expression analysis (e.g., emotion detection in healthcare)
Touchless control systems
Privacy-preserving machine learning
Multimodal interaction models combining hand and facial gestures
Conclusion
This paper introduces a structured, landmark-based dataset for hand gesture and facial expression recognition, giving a computationally efficient and privacy-focused alternative to traditional image-based datasets. By making use of the MediaPipe framework for three-dimensional landmark extraction, we enable the development of robust AI models for a wide range of applications like sign language detection, human-computer interaction and emotion detection. Future work includes expanding dataset diversity, incorporating motion dynamics, and benchmarking deep learning models on the dataset.
References
[1] Google AI, “MediaPipe Solutions Setup for Python”, Google AI Developer Guide, Retrieved 10:49, February 10, 2025, from
https://ai.google.dev/edge/mediapipe/solutions/setup_python
[2] Python Software Foundation, “csv- CSV File Reading and Writing,” Python 3 Documentation, Retrieved 10:49, January 12, 2025, fromhttps://docs.python.org/3/library/csv.html
[3] A. Weyand, M. A. Babenko, S. Cao, L. Shen, A. Kolesnikov, R. Philbin, and T. Weyand, \"Google Landmarks Dataset v2 – A Large-Scale Benchmark for Instance-Level Recognition and Retrieval,\" Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2020, pp. 1-10
[4] OpenCV Contributors, \"OpenCV: OpenCV-Python Tutorials,\" OpenCV Documentation, Retrieved 10:49, January 15, 2025, from
https://docs.opencv.org/4.x/d6/d00/tutorial_py_root.html