相关文章- 学术资源搜索

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

被引用次数：140 相关文章所有 3 个版本

[PDF] arxiv.org

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

被引用次数：12 相关文章所有 3 个版本

[PDF] arxiv.org

Gemini: a family of highly capable multimodal models

G Team, R Anil, S Borgeaud, Y Wu, JB Alayrac… - arXiv preprint arXiv …, 2023 - arxiv.org

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable
capabilities across image, audio, video, and text understanding. The Gemini family consists …

被引用次数：924 相关文章所有 2 个版本

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

被引用次数：35 相关文章所有 3 个版本

[PDF] thecvf.com

Hypergraph attention networks for multimodal learning

ES Kim, WY Kang, KW On, YJ Heo… - Proceedings of the …, 2020 - openaccess.thecvf.com

One of the fundamental problems that arise in multimodal learning tasks is the disparity of
information levels between different modalities. To resolve this problem, we propose …

被引用次数：90 相关文章所有 5 个版本

[PDF] aaai.org

i-code: An integrative and composable multimodal learning framework

Z Yang, Y Fang, C Zhu, R Pryzant, D Chen… - Proceedings of the …, 2023 - ojs.aaai.org

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to
maintain a holistic worldview. Most current pretraining methods, however, are limited to one …

被引用次数：33 相关文章所有 5 个版本

[PDF] thecvf.com

Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

被引用次数：16 相关文章所有 3 个版本

[PDF] arxiv.org

Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning

W Li, C Gao, G Niu, X Xiao, H Liu, J Liu, H Wu… - arXiv preprint arXiv …, 2020 - arxiv.org

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and
cannot effectively adapt to each other. They can only utilize single-modal data (ie text or …

被引用次数：356 相关文章所有 4 个版本

[PDF] thecvf.com

Dynamic multimodal fusion

Z Xue, R Marculescu - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com

Deep multimodal learning has achieved great progress in recent years. However, current
fusion approaches are static in nature, ie, they process and fuse multimodal inputs with …

被引用次数：31 相关文章所有 6 个版本

[PDF] arxiv.org

Uibert: Learning generic multimodal representations for ui understanding

C Bai, X Zang, Y Xu, S Sunkara, A Rastogi… - arXiv preprint arXiv …, 2021 - arxiv.org

To improve the accessibility of smart devices and to simplify their usage, building models
which understand user interfaces (UIs) and assist users to complete their tasks is critical …

被引用次数：47 相关文章所有 5 个版本

高级搜索

QQ 群

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

Gemini: a family of highly capable multimodal models

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Hypergraph attention networks for multimodal learning

i-code: An integrative and composable multimodal learning framework

Compositional chain-of-thought prompting for large multimodal models

Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning

Dynamic multimodal fusion

Uibert: Learning generic multimodal representations for ui understanding

引用