TY - JOUR
T1 - Keyword-Based Diverse Image Retrieval With Variational Multiple Instance Graph
AU - Zeng, Yawen
AU - Wang, Yiru
AU - Liao, Dongliang
AU - Li, Gongfu
AU - Huang, Weijie
AU - Xu, Jin
AU - Cao, Da
AU - Man, Hong
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2023/12/1
Y1 - 2023/12/1
N2 - — The task of cross-modal image retrieval has recently attracted considerable research attention. In real-world scenarios, keyword-based queries issued by users are usually short and have broad semantics. Therefore, semantic diversity is as important as retrieval accuracy in such user-oriented services, which improves user experience. However, most typical cross-modal image retrieval methods based on single point query embedding inevitably result in low semantic diversity, while existing diverse retrieval approaches frequently lead to low accuracy due to a lack of cross-modal understanding. To address this challenge, we introduce an end-to-end solution termed variational multiple instance graph (VMIG), in which a continuous semantic space is learned to capture diverse query semantics, and the retrieval task is formulated as a multiple instance learning problems to connect diverse features across modalities. Specifically, a query-guided variational autoencoder is employed to model the continuous semantic space instead of learning a single-point embedding. Afterward, multiple instances of the image and query are obtained by sampling in the continuous semantic space and applying multihead attention, respectively. Thereafter, an instance graph is constructed to remove noisy instances and align cross-modal semantics. Finally, heterogeneous modalities are robustly fused under multiple losses. Extensive experiments on two real-world datasets have well verified the effectiveness of our proposed solution in both retrieval accuracy and semantic diversity.
AB - — The task of cross-modal image retrieval has recently attracted considerable research attention. In real-world scenarios, keyword-based queries issued by users are usually short and have broad semantics. Therefore, semantic diversity is as important as retrieval accuracy in such user-oriented services, which improves user experience. However, most typical cross-modal image retrieval methods based on single point query embedding inevitably result in low semantic diversity, while existing diverse retrieval approaches frequently lead to low accuracy due to a lack of cross-modal understanding. To address this challenge, we introduce an end-to-end solution termed variational multiple instance graph (VMIG), in which a continuous semantic space is learned to capture diverse query semantics, and the retrieval task is formulated as a multiple instance learning problems to connect diverse features across modalities. Specifically, a query-guided variational autoencoder is employed to model the continuous semantic space instead of learning a single-point embedding. Afterward, multiple instances of the image and query are obtained by sampling in the continuous semantic space and applying multihead attention, respectively. Thereafter, an instance graph is constructed to remove noisy instances and align cross-modal semantics. Finally, heterogeneous modalities are robustly fused under multiple losses. Extensive experiments on two real-world datasets have well verified the effectiveness of our proposed solution in both retrieval accuracy and semantic diversity.
KW - Cross-modal retrieval
KW - keyword-based image retrieval
KW - multiple instance graph
KW - variational autoencoder (VAE)
UR - http://www.scopus.com/inward/record.url?scp=85129429699&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85129429699&partnerID=8YFLogxK
U2 - 10.1109/TNNLS.2022.3168431
DO - 10.1109/TNNLS.2022.3168431
M3 - Article
C2 - 35482693
AN - SCOPUS:85129429699
SN - 2162-237X
VL - 34
SP - 10528
EP - 10537
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
IS - 12
ER -