Remarkable success has been made by deep convolutional neural network (CNN) models in semantic image segmentation. However, most segmentation models are based on classification networks which tend to learn image-level features and lost abundant spatial information due to repeated pooling and downsampling operations, and the CNN-based methods are not robust to inputs, hence directly applying existing segmentation methods to semantic video segmentation will result in spatially inconsecutive and temporally inconsistent segmentation predictions within one instance and of the same objects across adjacent frames, respectively. To tackle this challenge, we propose an Attention-Guided Network (AGNet) to adaptively strengthen inter-frame and intra-frame features for more precise segmentation predictions. Specifically, we append an adjacent attention module (AAM) and a spatial attention module (SAM) on the top of dilated fully convolutional network (FCN), which model the feature correlations in temporal and spatial dimensions, respectively. The AAM selectively enhances the inter-frame features of the same objects across adjacent frames for temporally consistent predictions. Meanwhile, the SAM selectively aggregates the intra-frame features within one instance for spatially consecutive predictions. Finally, we sum the outputs of the two attention modules to further improve feature representations which contribute to more precise segmentation predictions across temporal and spatial dimensions simultaneously. Extensive experiments demonstrate the effectiveness of the proposed method, obtaining state-of-the-art mean intersection of union (mIoU) of 75.22% on CamVid dataset.