A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching

Pradipto Das, Chenliang Xu, Richard F. Doell, Jason J. Corso

Research output: Contribution to journalConference articlepeer-review

238 Scopus citations

Abstract

The problem of describing images through natural language has gained importance in the computer vision community. Solutions to image description have either focused on a top-down approach of generating language through combinations of object detections and language models or bottom-up propagation of keyword tags from training images to test images through probabilistic or nearest neighbor techniques. In contrast, describing videos with natural language is a less studied problem. In this paper, we combine ideas from the bottom-up and top-down approaches to image description and propose a method for video description that captures the most relevant contents of a video in a natural language description. We propose a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions. We compare the results of our system to human descriptions in both short and long forms on two datasets, and demonstrate that final system output has greater agreement with the human descriptions than any single level.

Original languageEnglish
Article number6619184
Pages (from-to)2634-2641
Number of pages8
JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOIs
StatePublished - 2013
Event26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013 - Portland, OR, United States
Duration: 23 Jun 201328 Jun 2013

Keywords

  • multimodal topic model
  • natural language
  • video to text
  • video understanding

Fingerprint

Dive into the research topics of 'A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching'. Together they form a unique fingerprint.

Cite this