TY - GEN
T1 - Dynamic Visual Sequence Prediction with Motion Flow Networks
AU - Ji, Dinghuang
AU - Wei, Zheng
AU - Dunn, Enrique
AU - Frahm, Jan Michael
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/5/3
Y1 - 2018/5/3
N2 - We target the problem of synthesizing future motion sequences from a temporally ordered set of input images. Previous methods tackled this problem in two manners: predicting the future image pixel values and predicting the dense time-space trajectory of pixels. Towards this end, generative encoder-decoder networks have been widely adopted in both kinds of methods. However, pixel prediction with these networks has been shown to suffer from blurry outputs, since images are generated from scratch and there is no explicit enforcement of visual coherency. Alternately, crisp details can be achieved by transferring pixels from the input image through dense trajectory predictions, but this process requires pre-computed motion fields for training, which limit the learning ability for the neural networks. To synthesize realistic movement of objects under weak supervision (without pre-computed dense motion fields), we propose two novel network structures. Our first network encodes the input images as feature maps, and uses a decoder network to predict the future pixel correspondences for a series of subsequent time steps. The attained correspondence fields are then used to synthesize future views. Our second network focuses on human-centered capture by augmenting our framework to include sparse pose estimates [30] to guide our dense correspondence prediction. Compared with state-of-the-art pixel generating and dense trajectories predicting networks, our model performs better on synthetic as well as on real-world human body movement sequences.
AB - We target the problem of synthesizing future motion sequences from a temporally ordered set of input images. Previous methods tackled this problem in two manners: predicting the future image pixel values and predicting the dense time-space trajectory of pixels. Towards this end, generative encoder-decoder networks have been widely adopted in both kinds of methods. However, pixel prediction with these networks has been shown to suffer from blurry outputs, since images are generated from scratch and there is no explicit enforcement of visual coherency. Alternately, crisp details can be achieved by transferring pixels from the input image through dense trajectory predictions, but this process requires pre-computed motion fields for training, which limit the learning ability for the neural networks. To synthesize realistic movement of objects under weak supervision (without pre-computed dense motion fields), we propose two novel network structures. Our first network encodes the input images as feature maps, and uses a decoder network to predict the future pixel correspondences for a series of subsequent time steps. The attained correspondence fields are then used to synthesize future views. Our second network focuses on human-centered capture by augmenting our framework to include sparse pose estimates [30] to guide our dense correspondence prediction. Compared with state-of-the-art pixel generating and dense trajectories predicting networks, our model performs better on synthetic as well as on real-world human body movement sequences.
UR - http://www.scopus.com/inward/record.url?scp=85051038819&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85051038819&partnerID=8YFLogxK
U2 - 10.1109/WACV.2018.00119
DO - 10.1109/WACV.2018.00119
M3 - Conference contribution
AN - SCOPUS:85051038819
T3 - Proceedings - 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018
SP - 1038
EP - 1046
BT - Proceedings - 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018
T2 - 18th IEEE Winter Conference on Applications of Computer Vision, WACV 2018
Y2 - 12 March 2018 through 15 March 2018
ER -