Miradata: A large-scale video dataset with long durations and structured captions

MS Ryoo, H Zhou, S Kendre, C Qin, L Xue… - arXiv preprint arXiv …, 2024 - arxiv.org

We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos,
particularly designed to efficiently capture temporal information over multiple frames. BLIP-3 …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

J Li, Q Long, J Zheng, X Gao, R Piramuthu… - arXiv preprint arXiv …, 2024 - arxiv.org

In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the
post-training phase by distilling a highly capable consistency model from a pretrained T2V …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Progressive autoregressive video diffusion models

D Xie, Z Xu, Y Hong, H Tan, D Liu, F Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Current frontier video diffusion models have demonstrated remarkable results at generating
high-quality videos. However, they can only generate short video clips, normally around 10 …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

J Wang, C Wang, K Huang, J Huang, L Jin - arXiv preprint arXiv …, 2024 - arxiv.org

Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in
numerous applications. However, the emphasis on brief summary texts during pre-training …

被引用次数：2 相关文章所有 4 个版本

[PDF] arxiv.org

Mind the Time: Temporally-Controlled Multi-Event Video Generation

Z Wu, A Siarohin, W Menapace, I Skorokhodov… - arXiv preprint arXiv …, 2024 - arxiv.org

Real-world videos consist of sequences of events. Generating such sequences with precise
temporal control is infeasible with existing video generators that rely on a single paragraph …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

Q Wang, Y Shi, J Ou, R Chen, K Lin, J Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

As visual generation technologies continue to advance, the scale of video datasets has
expanded rapidly, and the quality of these datasets is critical to the performance of video …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

C Yang, X Dong, X Zhu, W Su, J Wang, H Tian… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Vision-Language Models (VLMs) have been extended to understand both images
and videos. Visual token compression is leveraged to reduce the considerable token length …

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

J Kim, H Kim, H Lee, YM Ro - arXiv preprint arXiv:2411.16173, 2024 - arxiv.org

Despite advances in Large Multi-modal Models, applying them to long and untrimmed video
content remains challenging due to limitations in context length and substantial memory …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Image Inpainting Models are Effective Tools for Instruction-guided Image Editing

X Ju, J Zhuang, Z Zhang, Y Bian, Q Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

This is the technique report for the winning solution of the CVPR2024 GenAI Media
Generation Challenge Workshop's Instruction-guided Image Editing track. Instruction-guided …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

X Li, Y Wang, J Yu, X Zeng, Y Zhu, H Huang… - arXiv preprint arXiv …, 2024 - arxiv.org

Long-context modeling is a critical capability for multimodal large language models
(MLLMs), enabling them to process long-form contents with implicit memorization. Despite …

高级搜索

QQ 群