Detailed action analysis, such as action detection, localization and segmentation, has received more and more attention in recent years. Compared to action classification, action segmentation and localization are more useful in many practical applications that require precise spatio-temporal information of the actions. However, performing action segmentation and localization is more challenging, because determining the pixel-level locations of action not only requires a strong spatial model that captures the visual appearances for the actions, but also calls for a temporal model that characterizes the dynamics of the actions. Most existing methods either use hand-crafted spatial models, or can only extract short-term motion information. In this paper, we propose a 3D fully convolutional deep network to jointly exploit spatial and temporal information in a unified framework for action segmentation and localization. The proposed deep network is trained to combine both information in an end-to-end fashion. Extensive experimental results have shown that the proposed method outperforms state-of-the-art methods by a large margin.