TY - GEN
T1 - LAVS
T2 - 2021 IEEE International Conference on Multimedia and Expo, ICME 2021
AU - Zhu, Dandan
AU - Zhao, Defang
AU - Min, Xiongkuo
AU - Han, Tian
AU - Zhou, Qiangqiang
AU - Yu, Shaobo
AU - Chen, Yongqing
AU - Zhai, Guangtao
AU - Yang, Xiaokang
N1 - Publisher Copyright:
© 2021 IEEE
PY - 2021
Y1 - 2021
N2 - Audio information is essential for guiding human attention and visual perception, which has been verified by many comprehensive psychological studies. However, the audio modality has been rather neglected in modeling visual attention, most of the current visual attention models heavily depend on visual information. Additionally, current existing high-performing visual attention models rely on deeper convolution neural networks (CNNs), benefiting from their extraordinary feature learning ability but incurring high computational cost. To this end, we propose a novel lightweight audio-visual saliency (LAVS) model to efficiently address the problem of fixation prediction in videos. To the best of our knowledge, our proposed model constitutes the first attempt to exploit a lightweight network and combines the visual and audio cues to perform saliency estimation in videos. Specifically, our proposed model consists of four modules, which are spatial-temporal visual saliency estimation module, audio features extraction module, source sound localization module, and audio-visual saliency fusion module. Extensive experiments across datasets validate the effectiveness and real-time performance of the proposed LAVS model, which outperforms the other state-of-the-art methods.
AB - Audio information is essential for guiding human attention and visual perception, which has been verified by many comprehensive psychological studies. However, the audio modality has been rather neglected in modeling visual attention, most of the current visual attention models heavily depend on visual information. Additionally, current existing high-performing visual attention models rely on deeper convolution neural networks (CNNs), benefiting from their extraordinary feature learning ability but incurring high computational cost. To this end, we propose a novel lightweight audio-visual saliency (LAVS) model to efficiently address the problem of fixation prediction in videos. To the best of our knowledge, our proposed model constitutes the first attempt to exploit a lightweight network and combines the visual and audio cues to perform saliency estimation in videos. Specifically, our proposed model consists of four modules, which are spatial-temporal visual saliency estimation module, audio features extraction module, source sound localization module, and audio-visual saliency fusion module. Extensive experiments across datasets validate the effectiveness and real-time performance of the proposed LAVS model, which outperforms the other state-of-the-art methods.
KW - Audio-visual saliency
KW - deep canonical correlation analysis
KW - lightweight model
KW - saliency fusion
KW - visual attention
UR - http://www.scopus.com/inward/record.url?scp=85121110324&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85121110324&partnerID=8YFLogxK
U2 - 10.1109/ICME51207.2021.9428415
DO - 10.1109/ICME51207.2021.9428415
M3 - Conference contribution
AN - SCOPUS:85121110324
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2021 IEEE International Conference on Multimedia and Expo, ICME 2021
Y2 - 5 July 2021 through 9 July 2021
ER -