Chat-univi: Unified visual representation empowers large language models with image and video...

B Lin, B Zhu, Y Ye, M Ning, P Jin, L Yuan - arXiv preprint arXiv:2311.10122, 2023 - arxiv.org

The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

被引用次数：124 相关文章所有 3 个版本

[PDF] thecvf.com

Timechat: A time-sensitive multimodal large language model for long video understanding

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …

被引用次数：26 相关文章所有 4 个版本

[PDF] arxiv.org

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models

M Ning, B Zhu, Y Xie, B Lin, J Cui, L Yuan… - arXiv preprint arXiv …, 2023 - arxiv.org

Video-based large language models (Video-LLMs) have been recently introduced, targeting
both fundamental improvements in perception and comprehension, and a diverse range of …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

Pllava: Parameter-free llava extension from images to videos for video dense captioning

L Xu, Y Zhao, D Zhou, Z Lin, SK Ng, J Feng - arXiv preprint arXiv …, 2024 - arxiv.org

Vision-language pre-training has significantly elevated performance across a wide range of
image-language applications. Yet, the pre-training process for video-related tasks demands …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

Sharegpt4video: Improving video understanding and generation with better captions

L Chen, X Wei, J Li, X Dong, P Zhang, Y Zang… - arXiv preprint arXiv …, 2024 - arxiv.org

We present the ShareGPT4Video series, aiming to facilitate the video understanding of large
video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Streaming long video understanding with large language models

R Qian, X Dong, P Zhang, Y Zang, S Ding, D Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

LLMBind: A unified modality-task integration framework

B Zhu, P Jin, M Ning, B Lin, J Huang, Q Song… - arXiv preprint arXiv …, 2024 - arxiv.org

While recent progress in multimodal large language models tackles various modality tasks,
they posses limited integration capabilities for complex multi-modality tasks, consequently …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

ST-LLM: Large Language Models Are Effective Temporal Learners

R Liu, C Li, H Tang, Y Ge, Y Shan, G Li - arXiv preprint arXiv:2404.00308, 2024 - arxiv.org

Large Language Models (LLMs) have showcased impressive capabilities in text
comprehension and generation, prompting research efforts towards video LLMs to facilitate …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Long Context Transfer from Language to Vision

P Zhang, K Zhang, B Li, G Zeng, J Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

Video sequences offer valuable temporal information, but existing large multimodal models
(LMMs) fall short in understanding extremely long videos. Many works address this by …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群