相关文章- 学术资源搜索

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

被引用次数：29 相关文章所有 2 个版本

[PDF] thecvf.com

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

被引用次数：103 相关文章所有 2 个版本

[PDF] thecvf.com

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Z Tang, Z Yang, M Khademi, Y Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in-
context interleaved multimodal representations. By aligning modalities with language for …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

被引用次数：7 相关文章所有 3 个版本

[PDF] thecvf.com

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

被引用次数：21 相关文章所有 6 个版本

[PDF] thecvf.com

Generative multimodal models are in-context learners

Q Sun, Y Cui, X Zhang, F Zhang, Q Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …

被引用次数：56 相关文章所有 2 个版本

[PDF] openreview.net

Unified-io: A unified model for vision, language, and multi-modal tasks

J Lu, C Clark, R Zellers, R Mottaghi… - The Eleventh …, 2022 - openreview.net

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …

被引用次数：295 相关文章所有 4 个版本

[PDF] thecvf.com

Violin: A large-scale dataset for video-and-language inference

J Liu, W Chen, Y Cheng, Z Gan, L Yu… - Proceedings of the …, 2020 - openaccess.thecvf.com

We introduce a new task, Video-and-Language Inference, for joint multimodal
understanding of video and text. Given a video clip with aligned subtitles as premise, paired …

被引用次数：70 相关文章所有 8 个版本

[PDF] thecvf.com

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com

As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

被引用次数：208 相关文章所有 11 个版本

[PDF] arxiv.org

Pandagpt: One model to instruction-follow them all

Y Su, T Lan, H Li, J Xu, Y Wang, D Cai - arXiv preprint arXiv:2305.16355, 2023 - arxiv.org

We present PandaGPT, an approach to emPower large lANguage moDels with visual and
Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can …

被引用次数：156 相关文章所有 3 个版本

高级搜索

QQ 群

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

Clippo: Image-and-language understanding from pixels only

Generative multimodal models are in-context learners

Unified-io: A unified model for vision, language, and multi-modal tasks

Violin: A large-scale dataset for video-and-language inference

Merlot reserve: Neural script knowledge through vision and language and sound

Pandagpt: One model to instruction-follow them all

引用