Is sora a world simulator? a comprehensive survey on general world models and beyond

Z Zhu, X Wang, W Zhao, C Min, N Deng, M Dou… - arXiv preprint arXiv …, 2024 - arxiv.org
General world models represent a crucial pathway toward achieving Artificial General
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …

Internvideo2: Scaling video foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …

Cogvideox: Text-to-video diffusion models with an expert transformer

Z Yang, J Teng, W Zheng, M Ding, S Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce CogVideoX, a large-scale diffusion transformer model designed for generating
videos based on text prompts. To efficently model video data, we propose to levearge a 3D …

Sharegpt4video: Improving video understanding and generation with better captions

L Chen, X Wei, J Li, X Dong, P Zhang, Y Zang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large
video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) …

Streaming long video understanding with large language models

R Qian, X Dong, P Zhang, Y Zang, S Ding, D Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

X Chi, Y Wang, A Cheng, P Fang, Z Tian, Y He… - arXiv preprint arXiv …, 2024 - arxiv.org
Massive multi-modality datasets play a significant role in facilitating the success of large
video-language models. However, current video-language datasets primarily provide text …

Image Conductor: Precision Control for Interactive Video Synthesis

Y Li, X Wang, Z Zhang, Z Wang, Z Yuan, L Xie… - arXiv preprint arXiv …, 2024 - arxiv.org
Filmmaking and animation production often require sophisticated techniques for
coordinating camera transitions and object movements, typically involving labor-intensive …

LLMs Meet Multimodal Generation and Editing: A Survey

Y He, Z Liu, J Chen, Z Tian, H Liu, X Chi, R Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …

MotionBooth: Motion-Aware Customized Text-to-Video Generation

J Wu, X Li, Y Zeng, J Zhang, Q Zhou, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we present MotionBooth, an innovative framework designed for animating
customized subjects with precise control over both object and camera movements. By …

From Sora What We Can See: A Survey of Text-to-Video Generation

R Sun, Y Zhang, T Shah, J Sun, S Zhang, W Li… - arXiv preprint arXiv …, 2024 - arxiv.org
With impressive achievements made, artificial intelligence is on the path forward to artificial
general intelligence. Sora, developed by OpenAI, which is capable of minute-level world …