TY - JOUR
T1 - Multimedia event detection with multimodal feature fusion and temporal concept localization
AU - Oh, Sangmin
AU - McCloskey, Scott
AU - Kim, Ilseo
AU - Vahdat, Arash
AU - Cannons, Kevin J.
AU - Hajimirsadeghi, Hossein
AU - Mori, Greg
AU - Perera, A. G.Amitha
AU - Pandey, Megha
AU - Corso, Jason J.
PY - 2014/1
Y1 - 2014/1
N2 - We present a system for multimedia event detection. The developed system characterizes complex multimedia events based on a large array of multimodal features, and classifies unseen videos by effectively fusing diverse responses. We present three major technical innovations. First, we explore novel visual and audio features across multiple semantic granularities, including building, often in an unsupervised manner, mid-level and high-level features upon low-level features to enable semantic understanding. Second, we show a novel Latent SVM model which learns and localizes discriminative high-level concepts in cluttered video sequences. In addition to improving detection accuracy beyond existing approaches, it enables a unique summary for every retrieval by its use of high-level concepts and temporal evidence localization. The resulting summary provides some transparency into why the system classified the video as it did. Finally, we present novel fusion learning algorithms and our methodology to improve fusion learning under limited training data condition. Thorough evaluation on a large TRECVID MED 2011 dataset showcases the benefits of the presented system.
AB - We present a system for multimedia event detection. The developed system characterizes complex multimedia events based on a large array of multimodal features, and classifies unseen videos by effectively fusing diverse responses. We present three major technical innovations. First, we explore novel visual and audio features across multiple semantic granularities, including building, often in an unsupervised manner, mid-level and high-level features upon low-level features to enable semantic understanding. Second, we show a novel Latent SVM model which learns and localizes discriminative high-level concepts in cluttered video sequences. In addition to improving detection accuracy beyond existing approaches, it enables a unique summary for every retrieval by its use of high-level concepts and temporal evidence localization. The resulting summary provides some transparency into why the system classified the video as it did. Finally, we present novel fusion learning algorithms and our methodology to improve fusion learning under limited training data condition. Thorough evaluation on a large TRECVID MED 2011 dataset showcases the benefits of the presented system.
KW - Classification
KW - Fusion
KW - Machine learning
KW - Multimedia
UR - http://www.scopus.com/inward/record.url?scp=84894902895&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84894902895&partnerID=8YFLogxK
U2 - 10.1007/s00138-013-0525-x
DO - 10.1007/s00138-013-0525-x
M3 - Article
AN - SCOPUS:84894902895
SN - 0932-8092
VL - 25
SP - 49
EP - 69
JO - Machine Vision and Applications
JF - Machine Vision and Applications
IS - 1
ER -