Video-llava: Learning united visual representation by alignment before projection

B Lin, B Zhu, Y Ye, M Ning, P Jin, L Yuan - arXiv preprint arXiv:2311.10122, 2023 - arxiv.org
The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

Timechat: A time-sensitive multimodal large language model for long video understanding

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models

M Ning, B Zhu, Y Xie, B Lin, J Cui, L Yuan… - arXiv preprint arXiv …, 2023 - arxiv.org
Video-based large language models (Video-LLMs) have been recently introduced, targeting
both fundamental improvements in perception and comprehension, and a diverse range of …

Pllava: Parameter-free llava extension from images to videos for video dense captioning

L Xu, Y Zhao, D Zhou, Z Lin, SK Ng, J Feng - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-language pre-training has significantly elevated performance across a wide range of
image-language applications. Yet, the pre-training process for video-related tasks demands …

Sharegpt4video: Improving video understanding and generation with better captions

L Chen, X Wei, J Li, X Dong, P Zhang, Y Zang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large
video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) …

Streaming long video understanding with large language models

R Qian, X Dong, P Zhang, Y Zang, S Ding, D Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …

LLMBind: A unified modality-task integration framework

B Zhu, P Jin, M Ning, B Lin, J Huang, Q Song… - arXiv preprint arXiv …, 2024 - arxiv.org
While recent progress in multimodal large language models tackles various modality tasks,
they posses limited integration capabilities for complex multi-modality tasks, consequently …

ST-LLM: Large Language Models Are Effective Temporal Learners

R Liu, C Li, H Tang, Y Ge, Y Shan, G Li - arXiv preprint arXiv:2404.00308, 2024 - arxiv.org
Large Language Models (LLMs) have showcased impressive capabilities in text
comprehension and generation, prompting research efforts towards video LLMs to facilitate …

Long Context Transfer from Language to Vision

P Zhang, K Zhang, B Li, G Zeng, J Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Video sequences offer valuable temporal information, but existing large multimodal models
(LMMs) fall short in understanding extremely long videos. Many works address this by …