Cortex is an AI-powered, cross-platform voice assistant designed for desktop environments, capable of running natively on Windows, Linux, and macOS. The system integrates state-of-the-art speech recognition via OpenAI\'s Whisper model (accelerated using faster-whisper), a custom Natural Language Understanding (NLU) engine built on scikit-learn with multi-layered intent classification, and high-quality Text-to-Speech synthesis using the Piper TTS engine. Cortex enables hands-free control of a wide range of computer operations including file management, system monitoring, application control, security tools, workspace management, and workflow automation. The system employs a 5-tier intent resolution pipeline—carrier phrase matching, anchor filtering, keyword boosting, template matching, and ML-based classification—to achieve highly accurate and efficient command recognition. A modern, responsive PyQt6-based graphical user interface provides real-time visual feedback during voice interactions. Experimental results demonstrate that Cortex significantly improves the accessibility and productivity of desktop computing for users with motor impairments and for professionals seeking a faster, hands-free computing workflow. This paper presents the system architecture, methodology, implementation details, and performance evaluation of the Cortex voice assistant.
Introduction
The text presents Cortex, a fully offline, privacy-focused voice assistant designed to improve desktop productivity by overcoming limitations of cloud-based systems like Alexa, Siri, and Google Assistant. Unlike existing assistants, Cortex runs locally on the user’s device, ensuring privacy, low latency, and offline functionality.
The system uses a modular architecture with four layers: audio processing, intelligence (NLU), execution, and user interface. Voice input is captured and transcribed using offline speech recognition (Whisper-based), then processed through a 5-tier hybrid intent recognition system that combines rule-based methods (keywords, patterns, templates) with a machine learning fallback model. Recognized commands are executed via system-level modules that can control files, applications, and automate workflows. Responses are delivered using offline text-to-speech synthesis and a graphical desktop interface.
Key contributions include:
A hybrid NLU pipeline combining rule-based logic and machine learning for intent detection
Fully offline speech recognition and synthesis for privacy
Cross-platform support (Windows, Linux, macOS)
A workflow automation system for multi-step voice-controlled tasks
The literature review highlights progress in voice assistants (Siri, Alexa, Google Assistant) and offline speech technologies (Whisper, Piper TTS), but notes gaps in privacy, offline capability, and deep system integration. Cortex addresses these gaps by combining modern AI models with a fully local, extensible design.
Overall, Cortex is a privacy-preserving, offline voice assistant that enables real-time voice control of desktop systems through intelligent intent recognition and automation.
Conclusion
This paper presented Cortex, an AI-powered, cross-platform desktop voice assistant designed to deliver high-accuracy intent recognition and broad system control capabilities entirely offline. The system\'s 5-tier NLU pipeline achieved 94.7% intent classification accuracy, while the integration of faster-whisper and Piper TTS ensured practical, low-latency interaction without requiring cloud connectivity. Cortex did not want to be opened every time like any other voice assistants, Cortex will always floats in the user screen in desired place. Cortex demonstrates that a fully local, open-source voice assistant can achieve competitive performance compared to cloud-dependent commercial alternatives, while preserving user privacy and enabling deep system integration across multiple operating systems. The modular architecture and JSON-based intent schema significantly lower the barrier to extending the system with new capabilities. Future work includes the integration of a Large Language Model (LLM) for open-ended conversational responses, improved context management across multi-turn interactions, a mobile companion application, and an online model marketplace for community-contributed intent packs and voice models.
References
[1] C. Nass and C. Brave, Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship. MIT Press, 2005.
[2] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli, \'Multimodal fusion for multimedia analysis: a survey,\' Multimedia Systems, vol. 16, pp. 345–379, 2010.
[3] T. B. Lee and M. Fischetti, \'Siri and the future of voice control,\' Communications of the ACM, vol. 55, no. 1, pp. 5–6, 2012.
[4] Amazon, \'Alexa Voice Service Overview,\' Amazon Developer Documentation, 2024. [Online]. Available: https://developer.amazon.com/en-US/alexa
[5] Google, \'Google Assistant Technical Overview,\' Google AI Blog, 2020. [Online]. Available: https://ai.google/research/
[6] A. Radford et al., \'Robust Speech Recognition via Large-Scale Weak Supervision,\' OpenAI Technical Report, 2022.
[7] G. Leclerc, \'faster-whisper: Reimplementation of OpenAI\'s Whisper model using CTranslate2,\' GitHub, 2023. [Online]. Available: https://github.com/guillaumekln/faster-whisper
[8] M. Hansen, \'Piper: A fast, local neural text-to-speech system,\' Rhasspy Project, 2023. [Online]. Available: https://github.com/rhasspy/piper
[9] Mycroft AI, \'Mycroft: Open Source AI Voice Assistant,\' GitHub, 2020. [Online]. Available: https://github.com/MycroftAI/mycroft-core
[10] C. Veaux, J. Yamagishi, and S. King, \'The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,\' Proc. Oriental COCOSDA, 2013.