TY - GEN
T1 - Cross-view action modeling, learning, and recognition
AU - Wang, Jiang
AU - Nie, Xiaohan
AU - Xia, Yin
AU - Wu, Ying
AU - Zhu, Song Chun
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/9/24
Y1 - 2014/9/24
N2 - Existing methods on video-based action recognition are generally view-dependent, i.e., performing recognition from the same views seen in the training data. We present a novel multiview spatio-temporal and-or graph (MST-AOG) representation for cross-view action recognition, i.e., the recognition is performed on the video from an unknown and unseen view. As a compositional model, MST-AOG compactly represents the hierarchical combinatorial structures of cross-view actions by explicitly modeling the geometry, appearance and motion variations. This paper proposes effective methods to learn the structure and parameters of MST-AOG. The inference based on MST-AOG enables action recognition from novel views. The training of MST-AOG takes advantage of the 3D human skeleton data obtained from Kinect cameras to avoid annotating enormous multi-view video frames, which is error-prone and time-consuming, but the recognition does not need 3D information and is based on 2D video input. A new Multiview Action3D dataset has been created and will be released. Extensive experiments have demonstrated that this new action representation significantly improves the accuracy and robustness for cross-view action recognition on 2D videos.
AB - Existing methods on video-based action recognition are generally view-dependent, i.e., performing recognition from the same views seen in the training data. We present a novel multiview spatio-temporal and-or graph (MST-AOG) representation for cross-view action recognition, i.e., the recognition is performed on the video from an unknown and unseen view. As a compositional model, MST-AOG compactly represents the hierarchical combinatorial structures of cross-view actions by explicitly modeling the geometry, appearance and motion variations. This paper proposes effective methods to learn the structure and parameters of MST-AOG. The inference based on MST-AOG enables action recognition from novel views. The training of MST-AOG takes advantage of the 3D human skeleton data obtained from Kinect cameras to avoid annotating enormous multi-view video frames, which is error-prone and time-consuming, but the recognition does not need 3D information and is based on 2D video input. A new Multiview Action3D dataset has been created and will be released. Extensive experiments have demonstrated that this new action representation significantly improves the accuracy and robustness for cross-view action recognition on 2D videos.
UR - http://www.scopus.com/inward/record.url?scp=84911405305&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84911405305&partnerID=8YFLogxK
U2 - 10.1109/CVPR.2014.339
DO - 10.1109/CVPR.2014.339
M3 - Conference contribution
AN - SCOPUS:84911405305
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 2649
EP - 2656
BT - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
PB - IEEE Computer Society
T2 - 27th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014
Y2 - 23 June 2014 through 28 June 2014
ER -