Valley: Video assistant with large language model enhanced ability

M Awais, M Naseer, S Khan, RM Anwer… - arXiv preprint arXiv …, 2023 - arxiv.org

Vision systems to see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

被引用次数：62 相关文章所有 2 个版本

[PDF] techrxiv.org

Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects

MU Hadi, R Qureshi, A Shah, M Irfan, A Zafar… - Authorea …, 2023 - techrxiv.org

Within the vast expanse of computerized language processing, a revolutionary entity known
as Large Language Models (LLMs) has emerged, wielding immense power in its capacity to …

被引用次数：79 相关文章所有 5 个版本

[PDF] arxiv.org

Video-llama: An instruction-tuned audio-visual language model for video understanding

H Zhang, X Li, L Bing - arXiv preprint arXiv:2306.02858, 2023 - arxiv.org

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models
(LLMs) with the capability of understanding both visual and auditory content in the video …

被引用次数：351 相关文章所有 3 个版本

[PDF] arxiv.org

Seed-bench: Benchmarking multimodal llms with generative comprehension

B Li, R Wang, G Wang, Y Ge, Y Ge, Y Shan - arXiv preprint arXiv …, 2023 - arxiv.org

Based on powerful Large Language Models (LLMs), recent generative Multimodal Large
Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting …

被引用次数：198 相关文章所有 2 个版本

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

被引用次数：100 相关文章所有 6 个版本

[PDF] thecvf.com

Mvbench: A comprehensive multi-modal video understanding benchmark

K Li, Y Wang, Y He, Y Li, Y Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …

被引用次数：51 相关文章所有 4 个版本

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

被引用次数：35 相关文章所有 3 个版本

[PDF] arxiv.org

Video-llava: Learning united visual representation by alignment before projection

B Lin, B Zhu, Y Ye, M Ning, P Jin, L Yuan - arXiv preprint arXiv:2311.10122, 2023 - arxiv.org

The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

被引用次数：124 相关文章所有 3 个版本

[PDF] thecvf.com

SEED-Bench: Benchmarking Multimodal Large Language Models

B Li, Y Ge, Y Ge, G Wang, R Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multimodal large language models (MLLMs) building upon the foundation of powerful large
language models (LLMs) have recently demonstrated exceptional capabilities in generating …

被引用次数：25 相关文章所有 3 个版本

[PDF] arxiv.org

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Z Xu, Y Zhang, E Xie, Z Zhao, Y Guo, KKY Wong… - arXiv preprint arXiv …, 2023 - arxiv.org

In the past decade, autonomous driving has experienced rapid development in both
academia and industry. However, its limited interpretability remains a significant unsolved …

被引用次数：82 相关文章所有 5 个版本

高级搜索

QQ 群