We present a novel approach for event detection in video by temporal sequence modeling. Exploiting temporal information has lain at the core of many approaches for video analysis (i.e., action, activity and event recognition). Unlike previous works doing temporal modeling at semantic event level, we propose to model temporal dependencies in the data at sub-event level without using event annotations. This frees our model from ground truth and addresses several limitations in previous work on temporal modeling. Based on this idea, we represent a video by a sequence of visual words learnt from the video, and apply the Sequence Memoizer  to capture long-range dependencies in a temporal context in the visual sequence. This data-driven temporal model is further integrated with event classification for jointly performing segmentation and classification of events in a video. We demonstrate the efficacy of our approach on two challenging datasets for visual recognition.