TY - JOUR
T1 - Predicting depression by using a novel deep learning model and video-audio-text multimodal data
AU - Li, Yifu
AU - Yang, Xueping
AU - Zhao, Meng
AU - Wang, Jiangtao
AU - Yao, Yudong
AU - Qian, Wei
AU - Qi, Shouliang
N1 - Publisher Copyright:
Copyright © 2025 Li, Yang, Zhao, Wang, Yao, Qian and Qi.
PY - 2025
Y1 - 2025
N2 - Objective: Depression is a prevalent mental health disorder affecting millions of people. Traditional diagnostic methods primarily rely on self-reported questionnaires and clinical interviews, which can be subjective and vary significantly between individuals. This paper introduces the Integrative Multimodal Depression Detection Network (IMDD-Net), a novel deep-learning framework designed to enhance the accuracy of depression evaluation by leveraging both local and global features from video, audio, and text cues. Methods: The IMDD-Net integrates these multimodal data streams using the Kronecker product for multimodal fusion, facilitating deep interactions between modalities. Within the audio modality, Mel Frequency Cepstrum Coefficient (MFCC) and extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) features capture local and global acoustic properties, respectively. For video data, the TimeSformer network extracts both fine-grained and broad temporal features, while the text modality utilizes a pre-trained BERT model to obtain comprehensive contextual information. The IMDD-Net’s architecture effectively combines these diverse data types to provide a holistic analysis of depressive symptoms. Results: Experimental results on the AVEC 2014 dataset demonstrate that the IMDD-Net achieves state-of-the-art performance in predicting Beck Depression Inventory-II (BDI-II) scores, with a Root Mean Square Error (RMSE) of 7.55 and a Mean Absolute Error (MAE) of 5.75. A classification to identify potential depression subjects can achieve an accuracy of 0.79. Conclusion: These results underscore the robustness and precision of the IMDD-Net, highlighting the importance of integrating local and global features across multiple modalities for accurate depression prediction.
AB - Objective: Depression is a prevalent mental health disorder affecting millions of people. Traditional diagnostic methods primarily rely on self-reported questionnaires and clinical interviews, which can be subjective and vary significantly between individuals. This paper introduces the Integrative Multimodal Depression Detection Network (IMDD-Net), a novel deep-learning framework designed to enhance the accuracy of depression evaluation by leveraging both local and global features from video, audio, and text cues. Methods: The IMDD-Net integrates these multimodal data streams using the Kronecker product for multimodal fusion, facilitating deep interactions between modalities. Within the audio modality, Mel Frequency Cepstrum Coefficient (MFCC) and extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) features capture local and global acoustic properties, respectively. For video data, the TimeSformer network extracts both fine-grained and broad temporal features, while the text modality utilizes a pre-trained BERT model to obtain comprehensive contextual information. The IMDD-Net’s architecture effectively combines these diverse data types to provide a holistic analysis of depressive symptoms. Results: Experimental results on the AVEC 2014 dataset demonstrate that the IMDD-Net achieves state-of-the-art performance in predicting Beck Depression Inventory-II (BDI-II) scores, with a Root Mean Square Error (RMSE) of 7.55 and a Mean Absolute Error (MAE) of 5.75. A classification to identify potential depression subjects can achieve an accuracy of 0.79. Conclusion: These results underscore the robustness and precision of the IMDD-Net, highlighting the importance of integrating local and global features across multiple modalities for accurate depression prediction.
KW - deep learning
KW - depression
KW - information fusion
KW - local and global features
KW - multimedia
UR - https://www.scopus.com/pages/publications/105018835013
UR - https://www.scopus.com/pages/publications/105018835013#tab=citedBy
U2 - 10.3389/fpsyt.2025.1602650
DO - 10.3389/fpsyt.2025.1602650
M3 - Article
AN - SCOPUS:105018835013
VL - 16
JO - Frontiers in Psychiatry
JF - Frontiers in Psychiatry
M1 - 1602650
ER -