TY - JOUR
T1 - Contrastive topic-enhanced network for video captioning
AU - Zeng, Yawen
AU - Wang, Yiru
AU - Liao, Dongliang
AU - Li, Gongfu
AU - Xu, Jin
AU - Man, Hong
AU - Liu, Bo
AU - Xu, Xiangmin
N1 - Publisher Copyright:
© 2023 Elsevier Ltd
PY - 2024/3/1
Y1 - 2024/3/1
N2 - In the field of video captioning, recent works usually focus on multi-modal video content understanding, in which transcripts are extracted from speech and are often adopted as an informational supplement. However, most existing works only consider transcripts as a supplementary modality, neglecting their potential in capturing high-level semantics, such as multi-modal topics. In fact, transcripts, as a textual attribute derived from the video, reflect the same high-level topics as the video content. Nonetheless, how to resolve the heterogeneity of multi-modal topics is still under-investigated and worth exploring. In this paper, we introduce a contrastive topic-enhanced network to consistently model heterogeneous topics, that is, inject an alignment module in advance, to learn a comprehensive latent topic space and guide caption generation. Specifically, our method includes a local semantic alignment module and a global topic fusion module. In the local semantic alignment module, a fine-grained semantic alignment at the clip-sentence granularity reduces the semantic gap between modalities. Extensive experiments have verified the effectiveness of our solution.
AB - In the field of video captioning, recent works usually focus on multi-modal video content understanding, in which transcripts are extracted from speech and are often adopted as an informational supplement. However, most existing works only consider transcripts as a supplementary modality, neglecting their potential in capturing high-level semantics, such as multi-modal topics. In fact, transcripts, as a textual attribute derived from the video, reflect the same high-level topics as the video content. Nonetheless, how to resolve the heterogeneity of multi-modal topics is still under-investigated and worth exploring. In this paper, we introduce a contrastive topic-enhanced network to consistently model heterogeneous topics, that is, inject an alignment module in advance, to learn a comprehensive latent topic space and guide caption generation. Specifically, our method includes a local semantic alignment module and a global topic fusion module. In the local semantic alignment module, a fine-grained semantic alignment at the clip-sentence granularity reduces the semantic gap between modalities. Extensive experiments have verified the effectiveness of our solution.
KW - Contrastive learning
KW - Multi-modal topic
KW - Multi-modal video understanding
KW - Video captioning
UR - http://www.scopus.com/inward/record.url?scp=85171993010&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85171993010&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2023.121601
DO - 10.1016/j.eswa.2023.121601
M3 - Article
AN - SCOPUS:85171993010
SN - 0957-4174
VL - 237
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 121601
ER -