Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Z Zhu, X Wang, W Zhao, C Min, N Deng, M Dou… - arXiv preprint arXiv …, 2024 - arxiv.org

General world models represent a crucial pathway toward achieving Artificial General
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

Internvideo2: Scaling video foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …

被引用次数：31 相关文章所有 3 个版本

[PDF] arxiv.org

Cogvideox: Text-to-video diffusion models with an expert transformer

Z Yang, J Teng, W Zheng, M Ding, S Huang… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce CogVideoX, a large-scale diffusion transformer model designed for generating
videos based on text prompts. To efficently model video data, we propose to levearge a 3D …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Sharegpt4video: Improving video understanding and generation with better captions

L Chen, X Wei, J Li, X Dong, P Zhang, Y Zang… - arXiv preprint arXiv …, 2024 - arxiv.org

We present the ShareGPT4Video series, aiming to facilitate the video understanding of large
video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

Streaming long video understanding with large language models

R Qian, X Dong, P Zhang, Y Zang, S Ding, D Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

X Chi, Y Wang, A Cheng, P Fang, Z Tian, Y He… - arXiv preprint arXiv …, 2024 - arxiv.org

Massive multi-modality datasets play a significant role in facilitating the success of large
video-language models. However, current video-language datasets primarily provide text …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Image Conductor: Precision Control for Interactive Video Synthesis

Y Li, X Wang, Z Zhang, Z Wang, Z Yuan, L Xie… - arXiv preprint arXiv …, 2024 - arxiv.org

Filmmaking and animation production often require sophisticated techniques for
coordinating camera transitions and object movements, typically involving labor-intensive …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

LLMs Meet Multimodal Generation and Editing: A Survey

Y He, Z Liu, J Chen, Z Tian, H Liu, X Chi, R Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

MotionBooth: Motion-Aware Customized Text-to-Video Generation

J Wu, X Li, Y Zeng, J Zhang, Q Zhou, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org

In this work, we present MotionBooth, an innovative framework designed for animating
customized subjects with precise control over both object and camera movements. By …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

From Sora What We Can See: A Survey of Text-to-Video Generation

R Sun, Y Zhang, T Shah, J Sun, S Zhang, W Li… - arXiv preprint arXiv …, 2024 - arxiv.org

With impressive achievements made, artificial intelligence is on the path forward to artificial
general intelligence. Sora, developed by OpenAI, which is capable of minute-level world …

高级搜索

QQ 群