TY - GEN
T1 - Modular multi-modal attention network for Alzheimer's disease detection using patient audio and language data
AU - Wang, Ning
AU - Cao, Yupeng
AU - Hao, Shuai
AU - Shao, Zongru
AU - Subbalakshmi, K. P.
N1 - Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - In this work, we propose a modular multi-modal architecture to automatically detect Alzheimer's disease using the dataset provided in the ADReSSo challenge. Both acoustic and text-based features are used in this architecture. Since the dataset provides only audio samples of controls and patients, we use Google cloud-based speech-to-text API to automatically transcribe the audio files to extract text-based features. Several kinds of audio features are extracted using standard packages. The proposed approach consists of 4 networks: C-attention-acoustic network (for acoustic features only), C-Attention-FT network (for linguistic features only), C-Attention-Embedding network (for language embeddings and acoustic embeddings), and a unified network (uses all of those features). The architecture combines attention networks and a convolutional neural network (CAttention network) in order to process these features. Experimental results show that the C-Attention-Unified network with Linguistic features and X-Vector embeddings achieves the best accuracy of 80.28% and F1 score of 0.825 on the test dataset.
AB - In this work, we propose a modular multi-modal architecture to automatically detect Alzheimer's disease using the dataset provided in the ADReSSo challenge. Both acoustic and text-based features are used in this architecture. Since the dataset provides only audio samples of controls and patients, we use Google cloud-based speech-to-text API to automatically transcribe the audio files to extract text-based features. Several kinds of audio features are extracted using standard packages. The proposed approach consists of 4 networks: C-attention-acoustic network (for acoustic features only), C-Attention-FT network (for linguistic features only), C-Attention-Embedding network (for language embeddings and acoustic embeddings), and a unified network (uses all of those features). The architecture combines attention networks and a convolutional neural network (CAttention network) in order to process these features. Experimental results show that the C-Attention-Unified network with Linguistic features and X-Vector embeddings achieves the best accuracy of 80.28% and F1 score of 0.825 on the test dataset.
KW - Acoustic feature
KW - Alzheimer's disease
KW - CNN-attention network
KW - Linguistic feature
KW - Multi-modal approach
UR - http://www.scopus.com/inward/record.url?scp=85119294157&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119294157&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-2024
DO - 10.21437/Interspeech.2021-2024
M3 - Conference contribution
AN - SCOPUS:85119294157
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 4196
EP - 4200
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -