The global increase in mental-health conditions such as anxiety, depression, and stress highlights the need for accessible and timely psychological evaluation. Conventional evaluation remains limited due to clinician shortages and the stigma associated with seeking help. This work presents a multimodal Virtual Psychiatrist Interviewer designed to facilitate adaptive and scalable early-stage mental-health screening. The proposed framework integrates DistilBERT for linguistic interpretation, a convolutional audio-emotion model to analyze vocal cues, and V2Face-based facial-affect recognition for visual understanding. An attention-driven fusion mechanism combines text, acoustic, and facial embeddings to capture complementary behavioral signals and produce robust preliminary assessments. The system is trained and evaluated on a curated mental health text dataset, the RAVDESS emotional speech corpus, and publicly available facial expression datasets. Experimental results demonstrate competitive performance on anxiety, depression, and stress detection tasks, while ablation studies confirm the contribution of each modality. The findings indicate the potential of the proposed system for real-time AI-assisted mental-health support
Introduction
The text presents the development of a multimodal virtual psychiatrist designed to support early mental-health screening in response to the global rise in anxiety, depression, and stress and the limited availability of professional mental-health resources. Barriers such as stigma, uncertainty about treatment, and restricted access to clinicians often delay timely evaluation, highlighting the need for accessible, automated screening tools.
Recent advances in artificial intelligence—particularly in natural language processing, speech emotion recognition, and facial-affect analysis—enable the detection of subtle emotional cues embedded in everyday communication. While many existing tools rely on a single modality (text, audio, or questionnaires), this study addresses the gap by integrating linguistic, acoustic, and visual signals within a unified framework that better reflects real psychiatric assessment.
The proposed system consists of three modality-specific streams:
Text analysis using DistilBERT to capture psychological markers in language,
Audio analysis using CNNs on MFCC features to detect emotional prosody and stress in speech, and
Facial-affect analysis using V2Face to identify micro-expressions linked to emotional distress.
These streams are combined using a residual attention-based fusion mechanism, which dynamically weights each modality according to its reliability in a given interaction. An adaptive interviewing module adjusts follow-up questions based on the user’s emotional state, mimicking real psychiatric interviews. The final output maps emotional intensity into clinically relevant categories (Normal, Mild, Moderate, Severe), aligned with standard screening scales such as PHQ-9 and GAD-7.
The system was evaluated using established text, audio (RAVDESS), and video datasets, with standardized preprocessing and training pipelines. Results show that the multimodal model consistently outperforms unimodal approaches, offering more stable and accurate predictions by compensating for weaknesses in individual modalities. Attention weights further reveal that the system adaptively prioritizes the most informative cues. Overall, the study demonstrates that multimodal, attention-driven AI systems can provide a more reliable and interpretable foundation for accessible, early mental-health screening.
Conclusion
Our study presented a Virtual Psychiatrist that detects emotions early by analyzing text, voice, and facial cues. It combines these three signals using a simple AI approach, making predictions more accurate and stable than using one source alone. Each part—speech, words, and expressions—adds unique value. While it can’t replace real doctors, it can support early mental health assessments, especially where experts are limited. Tests showed it works well in different settings and can spot subtle emotional shifts. Although it still needs real-world testing and clinical approval, this system is a promising step toward using AI safely and effectively in mental health care.
References
[1] T. Zhang, A. M. Schoene, S. Ji, and S. Ananiadou, “Natural language processing applied to mental illness detection: a narrative review,” npj Digital Medicine, vol. 5, no. 46, 2022.
[2] R. Francese and P. Attanasio, “Emotion detection for supporting depression screening,” Multimedia Tools and Applications, vol. 82, pp. 12771–12795, 2023.
[3] J. Aina, “A Hybrid Learning-Architecture for Mental Disorder Detection Leveraging Object Detection Algorithms,” Frontiers in Public Health, vol. 12, 2024.
[4] D. Caulley et al., “Pilot study for artificial intelligence- enabled innova- tion to detect intensity of emotions in audio recordings,” JMIR Research Protocols, vol. 12, e51912, 2023.
[5] J. Pan, “Multimodal emotion recognition based on facial expressions and deep learning,” Frontiers in Psychology, vol. 14, 2023.
[6] Y. Wu, Q. Mi, and T. Gao, “A comprehensive review of multimodal emotion recognition: techniques, challenges, and future directions,” Biomimetics, vol. 10, no. 7, 2025.
[7] Z. Al Sahili, I. Patras, and M. Purver, “Multimodal machine learning in mental health: a survey of data, algorithms, and challenges,” arXiv preprint, arXiv:2501.00000, 2025.
[8] K. Devarajan, “Enhancing emotion recognition through multimodal data: a graph neural network approach,” Elsevier, 2025.
[9] A. R. Menon and L. Hart, “A multimodal deep-fusion framework for early detection of anxiety disorders,” IEEE Transactions on Affective Computing, vol. 14, no. 3, pp. 520–532, 2024.
[10] S. Gupta, M. Noor, and Y. Liang, “Speech–text fusion networks for automated psychological screening,” IEEE Access, vol. 12, pp. 18145– 18158, 2024.
[11] H. Park and J. Silva, “Facial micro-expression analysis with transformer-based audio alignment for mental-state estimation,” IEEE Transactions on Multimedia, vol. 26, pp. 3041–3052, 2024.
[12] R. K. Chandra and P. Vyas, “Cross-modal attention models for depression severity prediction using audio, video, and text,” in Proc. IEEE ICASSP, pp. 1–5, 2023.
[13] L. Moreno, C. Tan, and F. Ibrahim, “An adaptive multimodal interview agent for emotion recognition in clinical settings,” IEEE Intelligent Systems, vol. 39, no. 1, pp. 72–82, 2025.