Predicting depression by using a novel deep learning model and video-audio-text multimodal data

  • Yifu Li
  • , Xueping Yang
  • , Meng Zhao
  • , Jiangtao Wang
  • , Yudong Yao
  • , Wei Qian
  • , Shouliang Qi

Research output: Contribution to journalArticlepeer-review

Abstract

Objective: Depression is a prevalent mental health disorder affecting millions of people. Traditional diagnostic methods primarily rely on self-reported questionnaires and clinical interviews, which can be subjective and vary significantly between individuals. This paper introduces the Integrative Multimodal Depression Detection Network (IMDD-Net), a novel deep-learning framework designed to enhance the accuracy of depression evaluation by leveraging both local and global features from video, audio, and text cues. Methods: The IMDD-Net integrates these multimodal data streams using the Kronecker product for multimodal fusion, facilitating deep interactions between modalities. Within the audio modality, Mel Frequency Cepstrum Coefficient (MFCC) and extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) features capture local and global acoustic properties, respectively. For video data, the TimeSformer network extracts both fine-grained and broad temporal features, while the text modality utilizes a pre-trained BERT model to obtain comprehensive contextual information. The IMDD-Net’s architecture effectively combines these diverse data types to provide a holistic analysis of depressive symptoms. Results: Experimental results on the AVEC 2014 dataset demonstrate that the IMDD-Net achieves state-of-the-art performance in predicting Beck Depression Inventory-II (BDI-II) scores, with a Root Mean Square Error (RMSE) of 7.55 and a Mean Absolute Error (MAE) of 5.75. A classification to identify potential depression subjects can achieve an accuracy of 0.79. Conclusion: These results underscore the robustness and precision of the IMDD-Net, highlighting the importance of integrating local and global features across multiple modalities for accurate depression prediction.

Original languageEnglish
Article number1602650
JournalFrontiers in Psychiatry
Volume16
DOIs
StatePublished - 2025

Keywords

  • deep learning
  • depression
  • information fusion
  • local and global features
  • multimedia

Fingerprint

Dive into the research topics of 'Predicting depression by using a novel deep learning model and video-audio-text multimodal data'. Together they form a unique fingerprint.

Cite this