Video object segmentation (VOS) plays an important role in video analysis and understanding, which in turn facilitates a number of diverse applications, including video editing, video rendering, and augmented reality / virtual reality. However, existing deep learning-based approaches rely heavily on a large number of pixel-wise annotated video frames to achieve promising results, which is notoriously laborious and costly. To address this, in this paper, we formulate unsupervised video object detection by exploring simulated dense labels and explicit motion clues. Specifically, we first propose an effective video label generator network based on the sparsely annotated frames and the flow motion between them. It can largely alleviate our dependence and limitation on the sparse labels. Furthermore, we propose a transformer-based architecture to model the appearance and motion clues simultaneously with the cross-attention module, in order to maximally overcome non-linear motion with potential occlusions. Extensive experiments show that the proposed method outperforms recent VOS methods on four popular benchmarks (i.e., DAVIS-16, FBMS, Youtube-VOS and SegTrack-v2). Moreover, the proposed method can be further applied to a wide range of wild scenes such as wild forests and animals. Because of its effectiveness and generalization, we believe that our method could serve as a useful basis for alleviating the dependence on dense annotation in video data.