TY - GEN
T1 - End-to-End Dense Video Captioning with Masked Transformer
AU - Zhou, Luowei
AU - Zhou, Yingbo
AU - Corso, Jason J.
AU - Socher, Richard
AU - Xiong, Caiming
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/12/14
Y1 - 2018/12/14
N2 - Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookII datasets, where we achieved 10.12 and 6.58 METEOR score, respectively.
AB - Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookII datasets, where we achieved 10.12 and 6.58 METEOR score, respectively.
UR - http://www.scopus.com/inward/record.url?scp=85059845385&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85059845385&partnerID=8YFLogxK
U2 - 10.1109/CVPR.2018.00911
DO - 10.1109/CVPR.2018.00911
M3 - Conference contribution
AN - SCOPUS:85059845385
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 8739
EP - 8748
BT - Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
T2 - 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
Y2 - 18 June 2018 through 22 June 2018
ER -