Watch what you just said: Image captioning with text-conditional attention

Luowei Zhou, Chenliang Xu, Parker Koch, Jason J. Corso

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

63 Scopus citations

Abstract

Attention mechanisms have attracted considerable interest in image captioning due to their powerful performance. However, existing methods use only visual content as attention and whether textual context can improve attention in image captioning remains unsolved. To explore this problem, we propose a novel attention mechanism, called text-conditional attention, which allows the caption generator to focus on certain image features given previously generated text. To obtain text-related image features for our attention model, we adopt the guiding Long Short-Term Memory (gLSTM) captioning architecture with CNN fine-tuning. Our proposed method allows joint learning of the image embedding, text embedding, text-conditional attention and language model with one network architecture in an end-to-end manner. We perform extensive experiments on the MS-COCO dataset. The experimental results show that our method outperforms state-of-the-art captioning methods on various quantitative metrics as well as in human evaluation, which supports the use of our text-conditional attention in image captioning.

Original languageEnglish
Title of host publicationThematic Workshops 2017 - Proceedings of the Thematic Workshops of ACM Multimedia 2017, co-located with MM 2017
Pages305-313
Number of pages9
ISBN (Electronic)9781450354165
DOIs
StatePublished - 23 Oct 2017
Event1st International ACM Thematic Workshops, Thematic Workshops 2017 - Mountain View, United States
Duration: 23 Oct 201727 Oct 2017

Publication series

NameThematic Workshops 2017 - Proceedings of the Thematic Workshops of ACM Multimedia 2017, co-located with MM 2017

Conference

Conference1st International ACM Thematic Workshops, Thematic Workshops 2017
Country/TerritoryUnited States
CityMountain View
Period23/10/1727/10/17

Keywords

  • Image captioning
  • LSTM, Neural Network
  • Multi-modal embedding

Fingerprint

Dive into the research topics of 'Watch what you just said: Image captioning with text-conditional attention'. Together they form a unique fingerprint.

Cite this