Abstract
In this paper, we present a temporal capsule network architecture to encode motion in videos as an instantiation parameter. The extracted motion is used to perform motion-compensated error concealment. We modify the original architecture and use a carefully curated dataset to enable the training of capsules spatially and temporally. First, we add the temporal dimension by taking co-located “patches” from three consecutive frames obtained from standard video sequences to form input data “cubes.” Second, the network is designed with an initial feature extraction layer that operates on all three dimensions to generate spatiotemporal features. Additionally, we implement the PrimaryCaps module with a recurrent layer, instead of a conventional convolutional layer, to extract short-term motion-related temporal dependencies and encode them as activation vectors in the capsule output. Finally, the capsule output is combined with the most-recent past frame and passed through a fully connected reconstruction network to perform motion-compensated error concealment. We study the effectiveness of temporal capsules by comparing the proposed model with architectures that do not include capsules. Although the quality of the reconstruction shows room for improvement, we successfully demonstrate that capsules-based architectures can be designed to operate in the temporal dimension to encode motion-related attributes as instantiation parameters. The accuracy of motion estimation is evaluated by comparing both the reconstructed frame outputs and the corresponding optical flow estimates with ground truth data.
Original language | English (US) |
---|---|
Pages (from-to) | 1369-1377 |
Number of pages | 9 |
Journal | Signal, Image and Video Processing |
Volume | 14 |
Issue number | 7 |
DOIs | |
State | Published - Oct 1 2020 |
Keywords
- Capsule networks
- Conv3D
- ConvLSTM
- Error concealment
- Motion estimation
ASJC Scopus subject areas
- Signal Processing
- Electrical and Electronic Engineering