Jointly modeling deep video and compositional text to bridge vision and language in a unified framework

Ran Xu, Caiming Xiong, Wei Chen, Jason J. Corso

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

171 Scopus citations

Abstract

Recently, joint video-language modeling has been attracting more and more attention. However, most existing approaches focus on exploring the language model upon on a fixed visual model. In this paper, we propose a unified framework that jointly models video and the corresponding text sentences. The framework consists of three parts: a compositional semantics language model, a deep video model and a joint embedding model. In our language model, we propose a dependency-tree staicture model that embeds sentence into a continuous vector space, which preserves visually grounded meanings and word order. In the visual model, we leverage deep neural networks to capture essential semantic information from videos. In the joint embedding model, we minimize the distance of the outputs of the deep video model and compositional language model in the joint space, and update these two models jointly. Based on these three parts, our system is able to accomplish three tasks: 1) natural language generation, and 2) video retrieval and 3) language retrieval. In the experiments, the results show our approach outperforms SVM, CRF and CCA baselines in predicting Subject-Verb-Object triplet and natural sentence generation, and is better than CCA in video retrieval and language retrieval tasks.

Original languageEnglish
Title of host publicationProceedings of the 29th AAAI Conference on Artificial Intelligence, AAAI 2015 and the 27th Innovative Applications of Artificial Intelligence Conference, IAAI 2015
Pages2346-2352
Number of pages7
ISBN (Electronic)9781577357018
StatePublished - 1 Jun 2015
Event29th AAAI Conference on Artificial Intelligence, AAAI 2015 and the 27th Innovative Applications of Artificial Intelligence Conference, IAAI 2015 - Austin, United States
Duration: 25 Jan 201530 Jan 2015

Publication series

NameProceedings of the National Conference on Artificial Intelligence
Volume3

Conference

Conference29th AAAI Conference on Artificial Intelligence, AAAI 2015 and the 27th Innovative Applications of Artificial Intelligence Conference, IAAI 2015
Country/TerritoryUnited States
CityAustin
Period25/01/1530/01/15

Fingerprint

Dive into the research topics of 'Jointly modeling deep video and compositional text to bridge vision and language in a unified framework'. Together they form a unique fingerprint.

Cite this