Modern web applications built on dynamic frameworks such as React and Next.js present significant challenges for traditional UI automation tools, which rely on brittle, selector-based approaches that frequently fail under interface changes. This paper introduces ZeroClick, a natural language-driven autonomous UI controller that bridges human intent and browser execution through an agentic architecture.
ZeroClick integrates a Tauri-based desktop interface, a FastAPI/LangGraph backend, and a Model Context Protocol (MCP)-enabled Playwright execution layer to enable end-to-end automation via high-level user prompts. At its core, the system employs a structured execution cycle termed the “Golden Loop”, which enforces iterative navigation, DOM state synchronization, action execution using ephemeral element references, and state invalidation to ensure robustness in dynamic environments.
Unlike conventional automation frameworks, ZeroClick incorporates context-aware reasoning through a ReAct-based agent, enabling adaptive interaction with complex single-page applications while mitigating issues such as stale references and inconsistent state propagation. Experimental evaluation across multi-step workflows—including form completion and dynamic navigation tasks—demonstrates improved reliability and reduced execution latency compared to traditional approaches.
The results highlight the potential of combining large language models with structured browser control protocols to achieve resilient, self-healing UI automation. ZeroClick establishes a scalable foundation for next-generation intelligent assistants capable of performing complex web interactions directly from natural language instructions.
Introduction
ZeroClick is an AI-powered autonomous web automation system designed to eliminate the manual effort involved in interacting with complex web applications. Traditional automation tools such as Selenium and Playwright rely on fragile selectors (e.g., CSS IDs and XPaths), making them vulnerable to UI changes and ineffective in modern Single Page Applications (SPAs) like React-based systems. ZeroClick addresses these limitations by enabling users to perform complex web tasks using natural language instructions rather than code.
The system combines a Tauri-based desktop application, a FastAPI/LangGraph backend, and a Playwright-based Model Context Protocol (MCP) server to create a bridge between human intent and browser execution. Unlike conventional automation frameworks, ZeroClick uses intelligent DOM observation, semantic understanding, and multi-agent reasoning to adapt to changing interfaces and execute workflows autonomously.
The architecture follows a modular design with four layers: a React/Tauri frontend for user interaction, a FastAPI backend for agent orchestration, a local SQLite database for session storage, and an MCP server that controls browser actions. At its core is a LangGraph ReAct agent that follows a structured “Golden Loop” workflow: navigate to a page, capture a DOM snapshot, reason about the next action, execute it using element reference IDs, invalidate the old state, and re-synchronize before proceeding. This process ensures reliable interactions even on highly dynamic websites.
A key innovation is its ability to handle challenges associated with modern web frameworks. By recognizing React components, triggering proper synthetic events, and continuously updating its understanding of the DOM, ZeroClick avoids the stale references and broken scripts common in traditional automation tools. It also incorporates self-healing mechanisms that automatically recover from errors such as missing elements or closed browser targets by generating new DOM snapshots and revising execution plans.
The system was evaluated using metrics such as goal completion rate, error recovery reliability, execution latency, state consistency, and user transparency. Results showed that the Golden Loop architecture effectively eliminated stale-reference errors, while the self-healing capabilities allowed successful recovery from dynamic page changes without human intervention. Real-time streaming of screenshots and the agent’s reasoning process enhanced transparency and user trust.
Conclusion
ZeroClick successfully demonstrates that AI-driven browser automation can bridge the gap between complex human intent and technical UI execution through a resilient, agentic architecture. The project validates the core hypothesis that a Model Context Protocol (MCP) based multi-agent system can effectively navigate modern, dynamic web environments by replacing brittle, selector-based scripts with high-level semantic reasoning. By integrating LangGraph\'s stateful orchestration with Playwright\'s robust automation, the system provides a comprehensive and transparent platform for autonomous web task completion.
Key Achievements
• Resilient Automation: Successfully eliminated the \"maintenance bottleneck\" associated with traditional automation by utilizing a self-healing \"Golden Loop\" that synchronizes the agent with the live DOM tree before every interaction.
• React Compatibility: Addressed state synchronization issues in modern SPAs by implementing a specialized action executor that ensures state updates propagate correctly through native property setters and synthetic events.
• Standardized Integration: Utilized the Model Context Protocol (MCP) to provide a standardized, vendor-neutral bridge between the LLM and browser tools, reducing hallucinations and improving tool execution reliability.
• Transparent Execution: Developed a real-time streaming UI using Tauri and WebSockets, allowing users to trace the agent\'s thought process and monitor live browser screenshots with minimal latency.
System Performance Validation
The system consistently achieved high functional correctness across diverse web domains, including e-commerce and search engines. Average goal completion times ranged between 8 and 10 seconds per complex multi-step task, demonstrating significant efficiency improvements over manual processes. The platform successfully handled dynamic UI shifts and handled execution errors through an automated re-sync pipeline, achieving a performance level that significantly narrows the gap between AI agents and human users in realistic web environments.