Due to its importance, figure/ground segmentation in video has gained interest recently. The key factor of the segmentation is the construction of the spatio-temporal coherence. Previous works usually use the motion approximation as a measurement of the coherence, resulting in a low accuracy. In this paper, we present a novel method to measure the coherence, and an algorithm for target segmentation and tracking is proposed. Each image is abstracted by some compact and perceptually homogeneous elements, and by representing the elements as sparse linear combinations of dictionary templates, this algorithm capitalizes on the inherent low-rank structure of representations that are learned jointly. The coefficients of the constrained representation will act as the measurement of the spatio-temporal coherence. At last, a simple energy minimization solution with an online parameter-updating scheme is adopted in segmented stage, leading to a binary object's segmentation. Meanwhile, an adaptive dictionary is proposed to enhance the system's robust against occlusion. Our approach outperforms the state-of-the-art methods in object segmentation accuracy.