TY - GEN
T1 - Translating related words to videos and back through latent topics
AU - Das, Pradipto
AU - Srihari, Rohini K.
AU - Corso, Jason J.
PY - 2013
Y1 - 2013
N2 - Documents containing video and text are becoming more and more widespread and yet content analysis of those documents depends primarily on the text. Although automated discovery of semantically related words from text improves free text query understanding, translating videos into text summaries facilitates better video search particularly in the absence of accompanying text. In this paper, we propose a multimedia topic modeling framework suitable for providing a basis for automatically discovering and translating semantically related words obtained from textual metadata of multimedia documents to semantically related videos or frames from videos. The framework jointly models video and text and is flexible enough to handle different types of document features in their constituent domains such as discrete and real valued features from videos representing actions, objects, colors and scenes as well as discrete features from text. Our proposed models show much better fit to the multimedia data in terms of held-out data log likelihoods. For a given query video, our models translate low level vision features into bag of keyword summaries which can be further translated using simple natural language generation techniques into human readable paragraphs. We quantitatively compare the results of video to bag of words translation against a state-of-the-art baseline object recognition model from computer vision. We show that text translations from multimodal topic models vastly outperform the baseline on a multimedia dataset downloaded from the Internet.
AB - Documents containing video and text are becoming more and more widespread and yet content analysis of those documents depends primarily on the text. Although automated discovery of semantically related words from text improves free text query understanding, translating videos into text summaries facilitates better video search particularly in the absence of accompanying text. In this paper, we propose a multimedia topic modeling framework suitable for providing a basis for automatically discovering and translating semantically related words obtained from textual metadata of multimedia documents to semantically related videos or frames from videos. The framework jointly models video and text and is flexible enough to handle different types of document features in their constituent domains such as discrete and real valued features from videos representing actions, objects, colors and scenes as well as discrete features from text. Our proposed models show much better fit to the multimedia data in terms of held-out data log likelihoods. For a given query video, our models translate low level vision features into bag of keyword summaries which can be further translated using simple natural language generation techniques into human readable paragraphs. We quantitatively compare the results of video to bag of words translation against a state-of-the-art baseline object recognition model from computer vision. We show that text translations from multimodal topic models vastly outperform the baseline on a multimedia dataset downloaded from the Internet.
KW - multimedia topic models
KW - video to text summarization
UR - http://www.scopus.com/inward/record.url?scp=84874280480&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84874280480&partnerID=8YFLogxK
U2 - 10.1145/2433396.2433456
DO - 10.1145/2433396.2433456
M3 - Conference contribution
AN - SCOPUS:84874280480
SN - 9781450318693
T3 - WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining
SP - 485
EP - 494
BT - WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining
T2 - 6th ACM International Conference on Web Search and Data Mining, WSDM 2013
Y2 - 4 February 2013 through 8 February 2013
ER -