Stroke is the world’s second leading cause of death and a major cause of adult disability. Early warning signs such as facial drooping and speech impairment are often overlooked, resulting in delayed medical intervention. This paper presents a real-time multi-modal deep learning framework for early stroke risk detection using facial asymmetry and speech analysis. The proposed system utilizes a consumer webcam and microphone without requiring cloud connectivity or specialized medical hardware. Facial video frames are analyzed using a custom Convolutional Neural Network (CNN), while speech samples are processed using a Long Short-Term Memory (LSTM) model with Mel-Frequency Cepstral Coefficient (MFCC) features. The outputs are fused through a weighted probabilistic mechanism to classify stroke risk into Low, Moderate, or High categories. The framework also integrates a React-based dashboard and conversational AI assistant for user-friendly interaction. Experimental results show 93.2% accuracy, 91.8% sensitivity, and 94.3% specificity with an end-to-end latency below 3.5 seconds on consumer-grade hardware, demonstrating the effectiveness of the proposed approach for accessible real-world stroke risk screening.
Introduction
Stroke is a major neurological disorder and a leading cause of death, where early detection is crucial but often difficult due to subtle symptoms like facial droop and slurred speech that may be missed in emergency situations. Traditional screening methods like FAST are manual and unreliable, while clinical diagnosis can be delayed.
To address this, the paper proposes a real-time multimodal deep learning system that detects stroke risk using both facial and speech data captured through a webcam and microphone. Facial inputs are processed using a CNN to detect facial asymmetry, while speech signals are analyzed using MFCC features and an LSTM model to identify speech abnormalities such as dysarthria.
The outputs from both models are combined using a weighted fusion approach to generate a final stroke risk score categorized as low, moderate, or high. The system is designed for real-time use on consumer hardware without requiring cloud processing, improving accessibility, privacy, and speed. It also includes a FastAPI backend, a React-based dashboard for visualization, and an AI chatbot that provides explanations and emergency guidance.
Conclusion
The authors proposed in this study a multi-modal real-time deep learning system for stroke early warning indicating employing facial asymmetry analysis and speech parameters. The proposed system utilizes the imagery of the face and speech abnormality detection network using Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network respectively and improve the prediction accuracy and reliability of the proposed system using additional dynamic weighted probabilistic fusion mechanism. To demonstrate the validity of the proposed work, it was shown that the accuracy, sensitivity and specificity of the proposed framework were 93.2%, 91.8% and 94.3% respectively and end-to-end latency were found to be less than 3.5 seconds even on the consumer hardware. The seamless integration of a React-based dashboard and the Gemini AI chatbot that further enriched the user experience was spiced up by real-time visualization data and health data generated by the AI for the use of emergency action recommendations. The potential applications highlight the promise of lightweight, privacy-respecting AI systems for real-world applications in healthcare screening and provide a strong foundation for further refinement, such as multi-lingual, arm motion analysis and field validation.
References
[1] V. L. Feigin et al., “Global, regional, and national burden of stroke and its risk factors, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019,” *Lancet Neurology*, vol. 20, no. 10, pp. 795–820, Oct. 2021.
[2] W. Hacke et al., “Thrombolysis with alteplase 3 to 4.5 hours after acute ischemic stroke,” *New England Journal of Medicine*, vol. 359, no. 13, pp. 1317–1329, Sep. 2008.
[3] J. G. Harbison, H. Hossain, D. Jenkinson, J. Davis, S. J. Louw, and P. A. Ford, “Diagnostic accuracy of stroke referrals from primary care, emergency room physicians, and ambulance staff using the face arm speech test,” *Stroke*, vol. 34, no. 1, pp. 71–76, Jan. 2003.
[4] M. Obermeyer and E. J. Emanuel, “Predicting the future—big data, machine learning, and clinical medicine,” *New England Journal of Medicine*, vol. 375, no. 13, pp. 1216–1219, Sep. 2016.
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” *Neural Computation*, vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
[6] P. Ekman and W. V. Friesen, *Facial Action Coding System: A Technique for the Measurement of Facial Movement*. Consulting Psychologists Press, 1978.
[7] A. Sharma and R. Bhardwaj, “Facial asymmetry detection for neurological disorders using deep convolutional networks,” *Journal of Medical Systems*, vol. 47, no. 3, pp. 1–12, 2023.
[8] M. R. McNeil and T. E. Prescott, *Revised Token Test*. Pro-Ed, 1978.
[9] B. Milner and X. Shao, “Clean speech recognition using MFCC features and improved acoustic models,” in *Proc. IEEE ICASSP*, Montreal, Canada, 2004, pp. 965–968.
[10] F. Rudzicz, A. K. Namasivayam, and T. Wolff, “The TORGO database of acoustic and articulatory speech from speakers with dysarthria,” *Language Resources and Evaluation*, vol. 46, no. 4, pp. 523–541, Dec. 2012.
[11] T. Baltrusaitis, C. Ahuja, and L.-P. Morency, “Multi-modal machine learning: A survey and taxonomy,” *IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol. 41, no. 2, pp. 423–443, Feb. 2019.
[12] A. Acosta et al., “Multi-modal biomedical AI,” *Nature Medicine*, vol. 28, no. 9, pp. 1773–1784, Sep. 2022.
[13] R. P. Liston and M. L. Mickelborough, “Neurological examination of facial symmetry: a clinical guide,” *Journal of Neurology*, vol. 258, no. 7, pp. 1201–1213, Jul. 2011.
[14] Google AI, “Gemini 1.5 Flash API Documentation,” Google LLC. [Online]. Available: https://ai.google.dev. Accessed: Jan. 15, 2025.
[15] J. Brooke, “SUS: A retrospective,” *Journal of Usability Studies*, vol. 8, no. 2, pp. 29–40, Feb. 2013.
[16] M. Abadi et al., “TensorFlow: A system for large-scale machine learning,” in *Proc. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*, Savannah, GA, USA, 2016, pp. 265–283.
[17] S. Ramírez, “FastAPI,” Sebastián Ramírez. [Online]. Available: https://fastapi.tiangolo.com. Accessed: Jan. 20, 2025.
[18] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in *Proc. British Machine Vision Conference (BMVC)*, Swansea, UK, 2015, pp. 1–12.
[19] S. R. Livingstone and F. A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS),” *PLoS ONE*, vol. 13, no. 5, p. e0196391, May 2018.
[20] J. Ngiam et al., “Multi-modal deep learning,” in *Proc. 28th International Conference on Machine Learning (ICML)*, Bellevue, WA, USA, 2011, pp. 689–696.
[21] B. McFee et al., “Librosa: Audio and music signal analysis in Python,” in *Proc. 14th Python in Science Conference*, Austin, TX, USA, 2015, pp. 18–25.
[22] M. Lin, Q. Chen, and S. Yan, “Network in network,” in *Proc. International Conference on Learning Representations (ICLR)*, Banff, Canada, 2014.
[23] T. Wan, Z. Qin, and C. Wang, “Stroke facial drooping detection using fine-tuned VGGFace with asymmetry score,” in *Proc. IEEE International Symposium on Biomedical Imaging (ISBI)*, Nice, France, 2021, pp. 748–752.
[24] C. Garg, A. Bansal, and R. Agrawal, “Health data privacy in the age of AI: Challenges, regulations, and technical approaches,” *IEEE Access*, vol. 11, pp. 23401–23418, 2023.
[25] J. Buolamwini and T. Gebru, “Gender shades: Intersectional accuracy disparities in commercial gender classification,” in *Proc. ACM Conference on Fairness, Accountability, and Transparency (FAccT)*, New York, USA, 2018, pp. 77–91.