xgen-mm-vid (blip-3-video): You only need 32 tokens to represent a video even in vlms

MS Ryoo, H Zhou, S Kendre, C Qin, L Xue… - arXiv preprint arXiv …, 2024 - arxiv.org
We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos,
particularly designed to efficiently capture temporal information over multiple frames. BLIP-3 …

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

J Li, Q Long, J Zheng, X Gao, R Piramuthu… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the
post-training phase by distilling a highly capable consistency model from a pretrained T2V …

Progressive autoregressive video diffusion models

D Xie, Z Xu, Y Hong, H Tan, D Liu, F Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Current frontier video diffusion models have demonstrated remarkable results at generating
high-quality videos. However, they can only generate short video clips, normally around 10 …

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

J Wang, C Wang, K Huang, J Huang, L Jin - arXiv preprint arXiv …, 2024 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in
numerous applications. However, the emphasis on brief summary texts during pre-training …

Mind the Time: Temporally-Controlled Multi-Event Video Generation

Z Wu, A Siarohin, W Menapace, I Skorokhodov… - arXiv preprint arXiv …, 2024 - arxiv.org
Real-world videos consist of sequences of events. Generating such sequences with precise
temporal control is infeasible with existing video generators that rely on a single paragraph …

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

Q Wang, Y Shi, J Ou, R Chen, K Lin, J Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
As visual generation technologies continue to advance, the scale of video datasets has
expanded rapidly, and the quality of these datasets is critical to the performance of video …

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

C Yang, X Dong, X Zhu, W Su, J Wang, H Tian… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Vision-Language Models (VLMs) have been extended to understand both images
and videos. Visual token compression is leveraged to reduce the considerable token length …

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

J Kim, H Kim, H Lee, YM Ro - arXiv preprint arXiv:2411.16173, 2024 - arxiv.org
Despite advances in Large Multi-modal Models, applying them to long and untrimmed video
content remains challenging due to limitations in context length and substantial memory …

Image Inpainting Models are Effective Tools for Instruction-guided Image Editing

X Ju, J Zhuang, Z Zhang, Y Bian, Q Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
This is the technique report for the winning solution of the CVPR2024 GenAI Media
Generation Challenge Workshop's Instruction-guided Image Editing track. Instruction-guided …

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

X Li, Y Wang, J Yu, X Zeng, Y Zhu, H Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
Long-context modeling is a critical capability for multimodal large language models
(MLLMs), enabling them to process long-form contents with implicit memorization. Despite …