This paper presents a comprehensive touchless interface that combines hand gesture recognition with voice command control for Windows operating systems and system-level interactions. Utilizing state-of-the-art computer vision techniques, specifically MediaPipe Hands, and integrating speech recognition APIs, the system enables users to perform pointer movements, clicks, scrolls, window management, and multimedia controls entirely via intuitive gestures and voice commands. The system\'s architecture is modular, adaptable, and designed for real-time operation with low latency. Extensive evaluations demonstrate high accuracy, responsiveness, and user satisfaction, making it suitable for accessibility applications, public kiosks, sterile environments, and personal productivity enhancement. Extensive evaluations conducted across varied lighting conditions, backgrounds, and user profiles demonstrate high accuracy, responsiveness, and robustness. User experience studies further confirm the system’s intuitiveness and practicality, underscoring its potential for accessibility applications, sterile or hands-restricted environments, public kiosk interfaces, and enhanced personal productivity. Overall, the proposed multimodal interface establishes an effective, scalable, and user-friendly approach to realizing natural touchless interaction on modern computing platforms. Gesture recognition is refined using dynamic thresholding and temporal filtering to minimize false positives, while voice command accuracy is enhanced through contextual keyword mapping and noise-reduction techniques. Furthermore, the interface emphasizes user adaptability by offering customizable gesture mappings, multi-language support for voice inputs, and adjustable sensitivity levels based on user preference.
Introduction
Human–computer interaction (HCI) is evolving toward touchless interfaces to improve accessibility, hygiene, and usability, especially in environments like hospitals and cleanrooms. Traditional physical input devices (keyboard, mouse, touchscreen) are limited in these contexts. The Smart Vision Mouse and Voice Interaction system leverages computer vision, machine learning, and speech recognition to enable natural, hands-free control of computers using gestures and voice commands.
Key Components and Features:
Gesture Recognition: Uses a standard RGB webcam with MediaPipe Hands to detect 21 hand landmarks. Gestures like pointing, pinching, palm open, and swipes are mapped to cursor movement, clicks, scrolling, and window navigation.
Voice Control: Integrated speech recognition allows execution of high-level commands such as opening applications, adjusting volume, and media control.
Multimodal Fusion: Gesture and voice inputs operate in parallel, optimizing task suitability and reducing conflicts for a seamless interaction experience.
System Architecture: Modular design with independent gesture processing, voice processing, and action execution modules ensures scalability, easy updates, and deployment on Windows platforms.
Performance Optimization: Low-latency processing, temporal smoothing, and lightweight models ensure responsive cursor tracking and command execution even on moderate hardware. GPU acceleration improves real-time performance.
User Feedback and Robustness: Visual interfaces display real-time gesture and voice status. Adaptive algorithms ensure accurate detection under varied lighting, angles, and backgrounds.
Results:
The system provides accurate, low-latency, and intuitive interaction, significantly enhancing accessibility and hands-free computing. It performs optimally on modern desktops and laptops, with potential for further optimization on low-power devices. The interface allows customization of gestures, voice commands, and system parameters, offering a user-friendly and practical touchless control solution.
Conclusion
In conclusion, this research successfully presents a robust and real-time touchless interaction system that combines hand gesture recognition with voice command integration, offering an intuitive and contact-free interface for Windows-based environments. The modular design of the system ensures flexibility for future enhancements, such as support for additional gestures, multi-hand interaction, or cross-platform compatibility. Experimental results validate the system’s high accuracy, low latency, and user-friendly performance, demonstrating its potential for real-world deployment. Beyond convenience, this touchless interface holds significant value in domains such as assistive technology for users with mobility impairments, sterile or hygienic computing in healthcare settings, and immersive control in AR/VR environments. Overall, the research establishes a strong foundation for next-generation human–computer interaction systems that prioritize accessibility, efficiency, and user comfort.
The system’s successful integration of gesture tracking and voice commands showcases the potential of multimodal interaction frameworks to transform everyday computing highly suitable for environments where hands-free control is essential.
References
[1] On the Feasibility of Real-Time 3D Hand Tracking using Edge GPGPU Acceleration. (2018). arxiv. https://arxiv.org/pdf/1804.11256
[2] Accurate, Robust, and Flexible Real-time Hand Tracking. (2015). Microsoft Research. http://www.cs.toronto.edu/~jtaylor/papers/CHI2015-HandTracking.pdf
[3] Continuous hand gesture recognition: Benchmarks and methods. (2025). Computer Vision and marcoCVIU2025.pdf Image Understanding. https://www.eecs.ucf.edu/~jjl/pubs/
[4] Fast-Tracking Hand Gesture Recognition AI Applications with Pretrained Models from NGC. (2025). NVIDIA.https://developer.nvidia.com/blog/fast-tracking-hand-gesture recognition-ai-applications-with-pretrained-models-from-ngc/
[5] Robust Hand Gesture Recognition Using HOG Features and Machine Learning. (2024). Sohag Journal of Science.
article_346414_d94ceb0c1ef700044dda35299a48d760.pdf Institute https://sjsci.journals.ekb.eg/
[6] Using Haar Cascade for Object Detection. Machine Learning Mastery. https:// www.machinelearningmastery.com/using-haar-cascade-for-object-detection/
[7] Hand Gesture Recognition with Two Stage Approach Using Transfer Learning. (2023). arXiv. https://arxiv.org/abs/2309.11610
[8] Gesture-Based Interfaces: Designing for Touchless UX. (2023). Medium. https:// medium.com/@marketingtd64/gesture-based-interfaces-designing-for-touchless ux809a8b131705