TY - JOUR
T1 - Learning Compositional Sparse Bimodal Models
AU - Kumar, Suren
AU - Dhiman, Vikas
AU - Koch, Parker A.
AU - Corso, Jason J.
N1 - Publisher Copyright:
© 1979-2012 IEEE.
PY - 2018/5/1
Y1 - 2018/5/1
N2 - Various perceptual domains have underlying compositional semantics that are rarely captured in current models. We suspect this is because directly learning the compositional structure has evaded these models. Yet, the compositional structure of a given domain can be grounded in a separate domain thereby simplifying its learning. To that end, we propose a new approach to modeling bimodal perceptual domains that explicitly relates distinct projections across each modality and then jointly learns a bimodal sparse representation. The resulting model enables compositionality across these distinct projections and hence can generalize to unobserved percepts spanned by this compositional basis. For example, our model can be trained on red triangles and blue squares; yet, implicitly will also have learned red squares and blue triangles. The structure of the projections and hence the compositional basis is learned automatically; no assumption is made on the ordering of the compositional elements in either modality. Although our modeling paradigm is general, we explicitly focus on a tabletop building-blocks setting. To test our model, we have acquired a new bimodal dataset comprising images and spoken utterances of colored shapes (blocks) in the tabletop setting. Our experiments demonstrate the benefits of explicitly leveraging compositionality in both quantitative and human evaluation studies.
AB - Various perceptual domains have underlying compositional semantics that are rarely captured in current models. We suspect this is because directly learning the compositional structure has evaded these models. Yet, the compositional structure of a given domain can be grounded in a separate domain thereby simplifying its learning. To that end, we propose a new approach to modeling bimodal perceptual domains that explicitly relates distinct projections across each modality and then jointly learns a bimodal sparse representation. The resulting model enables compositionality across these distinct projections and hence can generalize to unobserved percepts spanned by this compositional basis. For example, our model can be trained on red triangles and blue squares; yet, implicitly will also have learned red squares and blue triangles. The structure of the projections and hence the compositional basis is learned automatically; no assumption is made on the ordering of the compositional elements in either modality. Although our modeling paradigm is general, we explicitly focus on a tabletop building-blocks setting. To test our model, we have acquired a new bimodal dataset comprising images and spoken utterances of colored shapes (blocks) in the tabletop setting. Our experiments demonstrate the benefits of explicitly leveraging compositionality in both quantitative and human evaluation studies.
KW - Multimodal learning
KW - artificial intelligence
KW - compositional learning
KW - human-robot interaction
KW - symbol grounding
KW - tabletop robotics
UR - http://www.scopus.com/inward/record.url?scp=85044846932&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85044846932&partnerID=8YFLogxK
U2 - 10.1109/TPAMI.2017.2693987
DO - 10.1109/TPAMI.2017.2693987
M3 - Article
C2 - 28422653
AN - SCOPUS:85044846932
SN - 0162-8828
VL - 40
SP - 1032
EP - 1044
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 5
ER -