TY - GEN
T1 - Watch what you just said
T2 - 1st International ACM Thematic Workshops, Thematic Workshops 2017
AU - Zhou, Luowei
AU - Xu, Chenliang
AU - Koch, Parker
AU - Corso, Jason J.
N1 - Publisher Copyright:
© 2017 Association for Computing Machinery.
PY - 2017/10/23
Y1 - 2017/10/23
N2 - Attention mechanisms have attracted considerable interest in image captioning due to their powerful performance. However, existing methods use only visual content as attention and whether textual context can improve attention in image captioning remains unsolved. To explore this problem, we propose a novel attention mechanism, called text-conditional attention, which allows the caption generator to focus on certain image features given previously generated text. To obtain text-related image features for our attention model, we adopt the guiding Long Short-Term Memory (gLSTM) captioning architecture with CNN fine-tuning. Our proposed method allows joint learning of the image embedding, text embedding, text-conditional attention and language model with one network architecture in an end-to-end manner. We perform extensive experiments on the MS-COCO dataset. The experimental results show that our method outperforms state-of-the-art captioning methods on various quantitative metrics as well as in human evaluation, which supports the use of our text-conditional attention in image captioning.
AB - Attention mechanisms have attracted considerable interest in image captioning due to their powerful performance. However, existing methods use only visual content as attention and whether textual context can improve attention in image captioning remains unsolved. To explore this problem, we propose a novel attention mechanism, called text-conditional attention, which allows the caption generator to focus on certain image features given previously generated text. To obtain text-related image features for our attention model, we adopt the guiding Long Short-Term Memory (gLSTM) captioning architecture with CNN fine-tuning. Our proposed method allows joint learning of the image embedding, text embedding, text-conditional attention and language model with one network architecture in an end-to-end manner. We perform extensive experiments on the MS-COCO dataset. The experimental results show that our method outperforms state-of-the-art captioning methods on various quantitative metrics as well as in human evaluation, which supports the use of our text-conditional attention in image captioning.
KW - Image captioning
KW - LSTM, Neural Network
KW - Multi-modal embedding
UR - http://www.scopus.com/inward/record.url?scp=85034833983&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85034833983&partnerID=8YFLogxK
U2 - 10.1145/3126686.3126717
DO - 10.1145/3126686.3126717
M3 - Conference contribution
AN - SCOPUS:85034833983
T3 - Thematic Workshops 2017 - Proceedings of the Thematic Workshops of ACM Multimedia 2017, co-located with MM 2017
SP - 305
EP - 313
BT - Thematic Workshops 2017 - Proceedings of the Thematic Workshops of ACM Multimedia 2017, co-located with MM 2017
Y2 - 23 October 2017 through 27 October 2017
ER -