AI-Based Framework for Early Detection of Depression Using Multimodal Data

Authors: Akanksha Maurya, Anukriti Mishra, Shreyash Pandey

DOI Link: https://doi.org/10.22214/ijraset.2025.72715

Abstract

Effective intervention and the avoidance of long-term psychological and emotional repercussions depend on early recognition of depression. However, because its early symptoms are subtle, complex, and vary from person to person, they are frequently disregarded [1]. Timely diagnosis is made more difficult by the fact that many persons in the early stages of depression may not seek care or may find it difficult to express their feelings [2]. This study presents a novel artificial intelligence (AI) framework for analysing multimodal data, including text, voice tone, and facial expressions, in order to identify early indicators of depression. The proposed system integrates cutting-edge deep learning modals: BERT is used for understanding contextual linguistic cues [3], CNNs extract significant emotional indicators from facial features [4], and RNNs capture the temporal dynamics and tone shifts in speech [5]. These modalities are fused through a structured data integration strategy, enabling the system to interpret emotional patterns more holistically and accurately. When tested using benchmark datasets like DAIC-WOZ [6], the system shows excellent accuracy and dependability in real-time, non-intrusive identification of depressed signs. Deeper emotional analysis is made possible by the integration of language, auditory, and visual information, which also increases the model’s generalizability and robustness across a range of topics [7]. With its scalable, easily available, and objective tools that enhance conventional approaches, this work demonstrates the expanding potential of AI in mental health care [8]. This paradigm facilitates prompt diagnosis and creates opportunities for tailored intervention methods by providing professionals with early, data-driven insights. In the end, it brings us one step closer to a time when technology can help to improve mental health and lessen the prevalence of untreated depression worldwide.

Introduction

1. Background & Motivation

Over 264 million people suffer from depression globally.
Traditional diagnostic tools (e.g., interviews, PHQ-9) are subjective, often delayed, and may miss early signs due to stigma and underreporting.
Advancements in AI and availability of multimodal data (text, audio, video) present opportunities for early, objective detection.

2. Research Contribution

This study proposes an AI-based multimodal framework that integrates:

Textual data (via BERT),
Audio features (e.g., MFCCs, processed with CNNs),
Visual cues (e.g., facial Action Units, processed with CNNs),
to detect early signs of depression using explainable and ethically responsible deep learning.

3. Literature Insights

Traditional methods are subjective and limited.
Early AI systems used unimodal data (mainly text), with limited context understanding.
Multimodal systems (e.g., DAIC-WOZ-based studies) show improved accuracy using feature fusion strategies (early, late, attention-based).
Deep learning models (CNNs, LSTMs, Transformers) improve contextual and temporal understanding.
Challenges include:
- Small datasets,
- Modality imbalance,
- Privacy concerns,
- Lack of interpretability and ethical transparency.

4. Methodology

A. Framework Structure

Modular pipeline: data acquisition → preprocessing → feature extraction → fusion → classification → explainability.
Fusion of context-rich, synchronized features from three modalities.

B. Datasets Used

DAIC-WOZ, CMU-MOSEI, and AVEC: include annotated interviews with text, audio, and video.

C. Preprocessing

Text: Cleaned, tokenized, embedded with BERT.
Audio: Extracted MFCCs, pitch, jitter, and spectral features.
Video: Extracted facial landmarks and AUs using OpenFace.

D. Feature Fusion & Classification

Hybrid fusion strategy (early + attention-based).
Bi-LSTM for temporal pattern learning.
Final classification: Depressed / Non-depressed.

E. Explainability

Used SHAP to visualize modality and feature contributions, ensuring transparency for clinicians.

5. Experimental Setup

Hardware: Intel i7, 32GB RAM, NVIDIA RTX 3060.
Software: Python 3.9, TensorFlow, PyTorch, OpenCV, Librosa.
Baseline models: SVM, Logistic Regression, Random Forest, Gradient Boosting.

6. Results & Findings

A. Classifier Performance

Ensemble models outperformed all others (highest accuracy, precision, recall, F1).

B. Dataset Evaluation

Best results on CMU-MOSEI, confirming that larger, well-annotated datasets enhance performance.

C. Modality Analysis

Text is the most informative single modality.
All three modalities combined yielded the best predictive performance.

D. ROC & Confusion Matrix

AUC = 0.93, indicating high classification reliability.
Low false positives; balanced sensitivity and specificity.

E. Statistical Evaluation

Cohen’s Kappa = 0.78, MCC = 0.76 — showing strong model agreement and robustness.

7. Key Insights

Multimodal fusion enhances detection accuracy significantly.
Model robustness validated across different datasets and classifiers.
Explainability & ethical AI elements support real-world clinical adoption.
Error sources included low-quality audio/video, suggesting future focus on preprocessing optimization.

Conclusion

The growing prevalence of depression as a global mental health concern highlights the urgent need for reliable, scalable, and early detection mechanisms. This research has proposed and implemented an advanced AI-based multimodal framework that leverages the integration of audio, visual, and textual data to detect depressive symptoms in individuals at an early stage. The framework utilizes pre-trained models for robust feature extraction and applies ensemble learning techniques to enhance classification performance, achieving notable accuracy and generalization across multiple datasets. Our results demonstrate that multimodal approaches significantly outperform unimodal models, with the ensemble method yielding an accuracy of 89.7% and an AUC of 0.93. Among all tested classifiers, ensemble learning proved most effective due to its ability to combine diverse decision patterns, mitigating the limitations of individual models. Textual features, extracted using language models like BERT, emerged as the most predictive single modality, reflecting the significance of linguistic cues in identifying depressive thought patterns. However, the fusion of text, audio, and visual features provided the most comprehensive insight into users\' affective states. The study further validated the effectiveness of this approach by evaluating performance across standard datasets such as DAIC-WOZ, AVEC, and CMU-MOSEI. This cross-dataset testing confirmed the framework’s generalizability and its potential for real-world applications in clinical, academic, and mobile health environments. In addition to quantitative metrics, the use of tools like ROC curves, confusion matrices, and correlation coefficients provided a holistic view of model reliability. In conclusion, this work offers a significant step toward the development of intelligent, multimodal mental health systems. It emphasizes not only technological innovation but also the ethical importance of timely and non-invasive mental health assessment. With further optimization and integration, such systems could become valuable tools in healthcare, capable of supporting mental wellness initiatives, reducing diagnostic delays, and ultimately improving quality of life for millions at risk of depression.

References

[1] M. A. Hall and A. B. Powell, “Challenges in Early Depression Diagnosis,” Journal of Affective Disorders, vol. 279, pp. 345–353, 2021. [2] C. L. Park and A. D. Edmondson, “Barriers to Help-Seeking in Young Adults with Depression,” Psychiatric Services, vol. 72, no. 3, pp. 312–318, 2021. [3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186. [4] Y. Zhang et al., “Facial Expression Recognition Using Deep CNNs,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2439–2451, May 2019. [5] A. Graves and J. Schmidhuber, “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures,” Neural Networks, vol. 18, no. 5–6, pp. 602–610, 2005. [6] M. Gratch et al., “The Distress Analysis Interview Corpus of Human and Computer Interviews,” in Proc. LREC, 2014, pp. 3123–3128. [7] J. Gideon et al., “Multimodal Analysis and Fusion for Depression Detection,” IEEE Transactions on Affective Computing, vol. 13, no. 2, pp. 805–818, 2022. [8] A. Cummins et al., “A Review of Depression Detection Through Multimodal Data Using AI,” IEEE Reviews in Biomedical Engineering, vol. 14, pp. 30–45, 2021. [9] World Health Organization, “Depression,” WHO Fact Sheets, Jan. 2020. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/depression [10] R. C. Kessler et al., “The prevalence and correlates of untreated serious mental illness,” Health Services Research, vol. 36, no. 6, pp. 987–1007, Dec. 2001. [11] T. Davenport and R. Kalakota, “The potential for AI in healthcare,” Future Healthcare Journal, vol. 6, no. 2, pp. 94–98, 2019. [12] D. Hazarika et al., “Multimodal depression detection: a survey and comparison,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 16, no. 3s, pp. 1–29, Jul. 2020. [13] R. Z. Huang et al., “Multimodal Transformer Fusion for Depression Estimation,” in Proc. IEEE Int. Conf. on Affective Computing and Intelligent Interaction (ACII), 2021, pp. 1–8. [14] A. Sharma and D. Singh, “MOGAM: Multimodal Object-Oriented Graph Attention Model for Depression Detection from Social Media,” IEEE Access, vol. 10, pp. 123456–123470, 2022. [15] A. Graves and J. Schmidhuber, “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures,” Neural Networks, vol. 18, no. 5–6, pp. 602–610, 2005. [16] Y. Zhang et al., “Facial Expression Recognition Using Deep CNNs,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2439–2451, May 2019. [17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186. [18] J. Gideon et al., “Multimodal Analysis and Fusion for Depression Detection,” IEEE Transactions on Affective Computing, vol. 13, no. 2, pp. 805–818, 2022. [19] R. L. Spitzer, K. Kroenke, J. B. W. Williams, and B. Löwe, “A brief measure for assessing generalized anxiety disorder: the GAD-7,” Arch. Intern. Med., vol. 166, no. 10, pp. 1092–1097, May 2006. [20] S. Kroenke, R. L. Spitzer, and J. B. Williams, “The PHQ-9: validity of a brief depression severity measure,” J. Gen. Intern. Med., vol. 16, no. 9, pp. 606–613, Sep. 2001. [21] P. Corrigan, “How stigma interferes with mental health care,” Am. Psychol., vol. 59, no. 7, pp. 614–625, 2004. [22] M. Guntuku et al., “Tracking mental health and symptom mentions on Twitter during COVID-19,” NPJ Digital Medicine, vol. 4, pp. 1–11, 2021. [23] B. Liu, “Sentiment Analysis and Opinion Mining,” Synthesis Lectures on Human Language Technologies, vol. 5, no. 1, pp. 1–167, 2012. [24] M. S. De Choudhury et al., “Predicting depression via social media,” in Proc. Int. AAAI Conf. Web and Social Media (ICWSM), 2013, pp. 128–137. [25] C. Busso et al., “The DAIC-WOZ dataset: Multimodal data for depression detection,” IEEE Trans. Affective Computing, vol. 9, no. 4, pp. 497–509, 2018. [26] T. Giannakopoulos and A. Pikrakis, Introduction to Audio Analysis: A MATLAB Approach. Academic Press, 2014. [27] P. Ekman and W. V. Friesen, “Facial Action Coding System (FACS),” Consulting Psychologists Press, 1978. [28] A. Baltrušaitis, C. Ahuja, and L. P. Morency, “Multimodal Machine Learning: A Survey and Taxonomy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, 2019. [29] S. Zadeh et al., “Tensor Fusion Network for Multimodal Sentiment Analysis,” in Proc. EMNLP, 2017, pp. 1103–1114. [30] Y. Kim, “Convolutional Neural Networks for Sentence Classification,” in Proc. EMNLP, 2014, pp. 1746–1751. [31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [32] A. Tsai et al., “Multimodal Transformer for Video Retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 7772–7781. [33] H. Li, J. Wu, and X. Yang, “Multimodal Fusion With Transformers for Depression Estimation,” IEEE J. Biomed. Health Inform., vol. 25, no. 7, pp. 2442–2451, Jul. 2021. [34] M. Ringeval et al., “AVEC 2017: Real-life depression, and affect recognition workshop and challenge,” in Proc. ACM Int. Conf. Multimodal Interaction, 2017, pp. 3–9. [35] J. Gideon et al., “Analyzing Modality Contribution in Multimodal Deep Learning for Behavioral Prediction,” in Proc. ACM Int. Conf. Multimodal Interaction, 2017, pp. 1–7. [36] S. Arora and S. Sabeti, “Privacy and security challenges in AI-enabled mental healthcare,” Health Policy Technol., vol. 10, no. 2, pp. 100543, 2021. [37] T. Samek et al., “Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models,” in Lecture Notes in Computer Science, vol. 11700, Springer, 2019. [38] M. Tjoa and C. Guan, “A survey on explainable artificial intelligence (XAI): Toward medical XAI,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 11, pp. 4793–4813, 2021. [39] S. Lundberg and S.-I. Lee, “A Unified Approach to Interpreting Model Predictions,” in Proc. NeurIPS, 2017, pp. 4765–4774. [40] B. McMahan et al., “Communication-efficient learning of deep networks from decentralized data,” in Proc. AISTATS, 2017, pp. 1273–1282. [41] G. Cummins, S. Scherer, and M. Schuller, “Multimodal Analysis for Affective Computing,” IEEE Trans. Affective Computing, vol. 11, no. 1, pp. 2–6, Jan.–Mar. 2020. [42] M. Tzirakis, J. Zhang, and B. W. Schuller, “End-to-End Multimodal Emotion Recognition using Deep Neural Networks,” IEEE J. Sel. Topics Signal Process., vol. 11, no. 8, pp. 1301–1309, Dec. 2017. [43] S. Raza, M. S. Hussain, and K. Afzal, “A Framework for Multimodal Depression Detection,” IEEE Access, vol. 9, pp. 139946–139957, 2021. [44] A. R. Hall, D. J. Sweeney, and H. N. Williams, “Linguistic and acoustic indicators of depression,” Cognitive Therapy and Research, vol. 32, no. 3, pp. 255–271, 2008. [45] M. Ringeval et al., “AVEC 2017: Real-life depression, and affect recognition workshop and challenge,” in Proc. ACM Int. Conf. Multimodal Interaction, 2017, pp. 3–9. [46] C. Busso et al., “The DAIC-WOZ dataset: Multimodal data for depression detection,” IEEE Trans. Affective Computing, vol. 9, no. 4, pp. 497–509, 2018. [47] S. Al Hanai, M. Ghassemi, and J. Glass, “Detecting Depression with Audio/Text Sequence Modeling of Interviews,” in Proc. Interspeech, 2018, pp. 1716–1720. [48] M. L. Miftahutdinov and T. A. Tutubalina, “Identifying Depression on Russian Language Forums with BERT,” in Proc. RANLP, 2019, pp. 1–10. [49] A. Coppersmith, M. Dredze, and C. Harman, “Quantifying mental health signals in Twitter,” in Proc. CLPsych Workshop, 2014, pp. 51–60. [50] M. Low et al., “Influence of speech and voice quality in depression detection,” in Proc. Interspeech, 2011, pp. 299–302. [51] T. Baltrušaitis, P. Robinson, and L. Morency, “OpenFace: An open source facial behavior analysis toolkit,” in IEEE Winter Conf. Appl. Comput. Vision, 2016, pp. 1–10. [52] P. Ekman and W. V. Friesen, “Facial Action Coding System (FACS),” Consulting Psychologists Press, 1978. [53] J. Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186. [54] T. N. Sainath et al., “Learning the speech front-end with raw waveform CLDNNs,” in Proc. Interspeech, 2015, pp. 1–5. [55] F. Zhang et al., “Facial expression recognition based on deep evolutional spatial-temporal networks,” IEEE Trans. Image Process., vol. 26, no. 9, pp. 4193–4203, Sep. 2017. [56] T. Han et al., “Temporal Alignment in Multimodal Depression Detection,” in Proc. ICASSP, 2020, pp. 914–918. [57] Y. Zadeh, P. Liang, and L. Morency, “Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph,” in Proc. ACL, 2018, pp. 2236–2246. [58] J. Hazarika et al., “Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos,” in Proc. NAACL, 2018, pp. 2122–2132. [59] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [60] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Netw., vol. 18, no. 5–6, pp. 602–610, Jul.–Aug. 2005. [61] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 1995, pp. 1137–1143. [62] M. Tjoa and C. Guan, “A survey on explainable artificial intelligence (XAI): Toward medical XAI,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 11, pp. 4793–4813, 2021. [63] S. Lundberg and S.-I. Lee, “A Unified Approach to Interpreting Model Predictions,” in Proc. NeurIPS, 2017, pp. 4765–4774. [64] C. Holzinger et al., “What do we need to build explainable AI systems for the medical domain?,” arXiv preprint arXiv:1712.09923, 2017.

Copyright

Copyright © 2025 Akanksha Maurya, Anukriti Mishra, Shreyash Pandey. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET72715

Publish Date : 2025-06-22

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here