This paper presents a lightweight, real-time voice authentication system designed for IoT devices to enhance security by allowing only authorized voice commands. It utilizes a MEMS-based INMP441 microphone and an ESP32-S3 microcontroller to capture audio, extract features via Mel-Frequency Cepstral Coefficients (MFCC), and classify them using a compact Convolutional Neural Network (CNN). Trained with TensorFlow and deployed with TensorFlowLite, the system supports efficient on-device inference, ideal for resource-limited edge hardware.Combining signal processing with deep learning, the solution ensures low latency, minimal power consumption, and enhanced privacy by performing all processing locally—avoiding cloud dependency. It demonstrates robust performance in diverse acoustic conditions and is well-suited for applications in smart homes, healthcare, and industrial automation.This work highlights the viability of embedded AI for secure, intuitive voice interfaces in IoT. Future improvements may include adaptive learning, multi-user support, and integration with other biometric modalities.
Introduction
The rapid growth of IoT devices in sectors like smart homes and healthcare demands secure, efficient user authentication methods. Traditional password-based methods are inadequate for IoT devices due to limited interfaces and resources. Voice biometric authentication emerges as a promising, hands-free alternative. However, most voice authentication relies on cloud processing, which causes latency, privacy risks, and requires internet connectivity.
To overcome these issues, edge-based voice authentication systems process data locally on the device, enhancing privacy, reducing latency, and improving energy efficiency. Implementing such systems on resource-constrained IoT devices poses challenges in memory, computation, and power. Lightweight techniques like Mel-Frequency Cepstral Coefficients (MFCC) are used for compact and efficient audio feature extraction, while compact Convolutional Neural Networks (CNNs) enable effective real-time speaker recognition on edge hardware.
Recent advances in microcontrollers (e.g., ESP32-S3) and digital MEMS microphones (e.g., INMP441) allow fully embedded voice authentication systems that operate independently of the cloud. The system described uses the INMP441 microphone and ESP32-S3 microcontroller, extracting MFCC features, classifying voice input via a 1D CNN model trained in TensorFlow, then converted to TensorFlowLite for efficient on-device inference.
The paper reviews literature on MFCC feature extraction, CNN optimization for embedded systems, real-time edge deployment, and hybrid feature approaches. The proposed architecture includes offline training with labeled voice samples and real-time, privacy-preserving inference on the edge device. Software tools include Python, Librosa, TensorFlow, and TensorFlowLite, while hardware features the ESP32-S3 board and INMP441 microphone connected via I²S protocol.
The CNN model is designed to balance accuracy with minimal resource use, employing convolutional, pooling, and global average pooling layers, culminating in a sigmoid-activated output for binary speaker authentication. The lightweight model architecture and hardware integration enable efficient, low-latency, and secure voice authentication suitable for IoT environments, setting the stage for future enhancements such as adaptive learning and multimodal systems.
Conclusion
This work presents a lightweight, privacy-preserving, and real-time voice authentication system tailored for deployment on edge IoT devices. By combining MFCC-based feature extraction with a compact 1D CNN architecture, the system achieves over 93% accuracy while maintaining a minimal computational and memory footprint. The integration with the ESP32-S3 microcontroller and INMP441 MEMS microphone allows the system to function independently of cloud resources, ensuring low latency, enhanced user privacy, and energy-efficient operation.
Unlike traditional cloud-based voice recognition systems, the proposed solution performs all processing on-device, eliminating dependence on internet connectivity and reducing potential privacy risks. This makes the system particularly suitable for smart home, healthcare, and industrial automation applications where low power, low latency, and secure operation are paramount.
The successful deployment and performance of the system demonstrate the viability of embedded deep learning for biometric authentication and pave the way for more intelligent, secure, and user-friendly interfaces in next-generation IoT ecosystems.
References
[1] Hou, L. et al. (2023). Intelligent Microsystem for Sound Event Recognition in Edge Computing Using End-to-End Mesh Networking.Sensors, 23(7), 3630.https://doi.org/10.3390/s23073630
[2] Lin, Z. Q., Chung, A. G., & Wong, A. (2018). EdgeSpeechNets: Highly Efficient Deep Neural Networks for Speech Recognition on the Edge. arXiv preprint. https://arxiv.org/abs/1810.08559
[3] Choi, S. (2020). How audio edge processors enable voice integration in IoT devices. Embedded.com. https://www.embedded.com/how-audio-edge-processors-enable-voice-integration-in-iot-devices
[4] Wilkinson, N., &Niesler, T. (2021). A Hybrid CNN-BiLSTM Voice Activity Detector.arXiv preprint. https://arxiv.org/abs/2103.03529
[5] Lin, Z. Q. et al. (2018). EdgeSpeechNets.https://arxiv.org/abs/1810.08559
[6] TDK Corporation. (2023). T5838 MEMS microphones preferred choice for edge AI applications. https://www.sensortips.com/mems-sensor-technology/t5838-mems-microphones-preferred-choice-for-edge-ai-applications/
[7] Espressif Systems. (2022). ESP32-S3 Datasheet.https://www.espressif.com/en/products/socs/esp32-s3
[8] Ji, Z., Cheng, G., Lu, T., & Shao, Z. (2024). Speaker recognition system based on MFCC feature extraction CNN architecture. Academic Journal of Computing & Information Science, 7(7), 47–59. https://doi.org/10.25236/AJCIS.2024.070707
[9] Revathi, A., Sasikaladevi, N., & Raju, N. (2024). Real-time implementation of voice based robust person authentication using T-F features and CNN. Multimedia Tools and Applications, 83, 31587–31601. https://doi.org/10.1007/s11042-023-16811-x
[10] Soleymanpour, M., &Marvi, H. (2017). Text-independent speaker recognition based on selection of the most similar feature vectors. International Journal of Speech Technology, 20, 99–108.
[11] Gaurav, Bhardwaj, S., & Agarwal, R. (2023). An efficient speaker identification framework based on Mask R-CNN classifier parameter optimized using hosted cuckoo optimization (HCO). Journal of Ambient Intelligence and Humanized Computing, 14, 13613–13625. https://doi.org/10.1007/s12652-022-03828-7
[12] El Malki, M., Ghoumid, K., & El Hammouti, A. (2022).IoT-based voice control using ESP32 and I2S digital microphone.International Journal of Advanced Computer Science and Applications (IJACSA), 13(7), 412–418.
[13] Davis, S., &Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
[14] Nagrani, A., Chung, J. S., & Zisserman, A. (2017). VoxCeleb: A large-scale speaker identification dataset. In Proc. INTERSPEECH, 2616–2620.
[15] Lane, N. D., Bhattacharya, S., Georgiev, P., Forlivesi, C., Jiao, P., Qendro, L., &Kawsar, F. (2016). DeepX: A software accelerator for low-power deep learning inference on mobile devices. In Proceedings of the 15th International Conference on Information Processing in Sensor Networks (IPSN), 1–12.
[16] Armando, M., Costa, G., Merlo, A., &Verderame, L. (2021). Security evaluation of IoT microcontroller platforms: The ESP32 case. Future Internet, 13(4), 102.
[17] Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980.
[18] El Malki, M., Ghoumid, K., & El Hammouti, A. (2022).IoT-based voice control using ESP32 and I2S digital microphone.International Journal of Advanced Computer Science and Applications (IJACSA), 13(7), 412–418.
[19] Davis, S., &Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
[20] Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., & Penn, G. (2014). Convolutional neural networks for speech recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.
[21] Lin, M., Chen, Q., & Yan, S. (2013). Network in network.arXiv preprint arXiv:1312.4400.
[22] Lane, N. D., Bhattacharya, S., Georgiev, P., Forlivesi, C., Jiao, P., Qendro, L., &Kawsar, F. (2016). DeepX: A software accelerator for low-power deep learning inference on mobile devices. In Proceedings of the 15th International Conference on Information Processing in Sensor Networks (IPSN), IEEE.
[23] Warden, P., &Situnayake, D. (2019). TinyML: Machine Learning with TensorFlowLite on Arduino and Ultra-Low-Power Microcontrollers. O\'Reilly Media.