TY - GEN
T1 - Click Here
T2 - 16th IEEE International Conference on Computer Vision, ICCV 2017
AU - Szeto, Ryan
AU - Corso, Jason J.
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/12/22
Y1 - 2017/12/22
N2 - We motivate and address a human-in-the-loop variant of the monocular viewpoint estimation task in which the location and class of one semantic object keypoint is available at test time. In order to leverage the keypoint information, we devise a Convolutional Neural Network called Click-Here CNN (CH-CNN) that integrates the keypoint information with activations from the layers that process the image. It transforms the keypoint information into a 2D map that can be used to weigh features from certain parts of the image more heavily. The weighted sum of these spatial features is combined with global image features to provide relevant information to the prediction layers. To train our network, we collect a novel dataset of 3D keypoint annotations on thousands of CAD models, and synthetically render millions of images with 2D keypoint information. On test instances from PASCAL 3D+, our model achieves a mean class accuracy of 90.7%, whereas the state-of-the-art baseline only obtains 85.7% mean class accuracy, justifying our argument for human-in-the-loop inference.
AB - We motivate and address a human-in-the-loop variant of the monocular viewpoint estimation task in which the location and class of one semantic object keypoint is available at test time. In order to leverage the keypoint information, we devise a Convolutional Neural Network called Click-Here CNN (CH-CNN) that integrates the keypoint information with activations from the layers that process the image. It transforms the keypoint information into a 2D map that can be used to weigh features from certain parts of the image more heavily. The weighted sum of these spatial features is combined with global image features to provide relevant information to the prediction layers. To train our network, we collect a novel dataset of 3D keypoint annotations on thousands of CAD models, and synthetically render millions of images with 2D keypoint information. On test instances from PASCAL 3D+, our model achieves a mean class accuracy of 90.7%, whereas the state-of-the-art baseline only obtains 85.7% mean class accuracy, justifying our argument for human-in-the-loop inference.
UR - http://www.scopus.com/inward/record.url?scp=85041913888&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85041913888&partnerID=8YFLogxK
U2 - 10.1109/ICCV.2017.177
DO - 10.1109/ICCV.2017.177
M3 - Conference contribution
AN - SCOPUS:85041913888
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 1604
EP - 1613
BT - Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017
Y2 - 22 October 2017 through 29 October 2017
ER -