- 学术资源搜索

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

被引用次数：108 相关文章所有 6 个版本

[PDF] arxiv.org

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T Xia, J Liu, C Li, H Hajishirzi… - arXiv preprint arXiv …, 2023 - arxiv.org

Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit
impressive skills in various domains, their ability for mathematical reasoning within visual …

被引用次数：128 相关文章所有 3 个版本

[PDF] thecvf.com

SEED-Bench: Benchmarking Multimodal Large Language Models

B Li, Y Ge, Y Ge, G Wang, R Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multimodal large language models (MLLMs) building upon the foundation of powerful large
language models (LLMs) have recently demonstrated exceptional capabilities in generating …

被引用次数：29 相关文章所有 3 个版本

[PDF] thecvf.com

Capsfusion: Rethinking image-text data at scale

Q Yu, Q Sun, X Zhang, Y Cui, F Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …

被引用次数：18 相关文章所有 3 个版本

[PDF] aclanthology.org

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Y Zeng, H Zhang, J Zheng, J Xia, G Wei… - Proceedings of the …, 2024 - aclanthology.org

Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in
processing image inputs and following open-ended instructions. Despite these …

被引用次数：43 相关文章所有 4 个版本

[PDF] thecvf.com

Binding touch to everything: Learning unified multimodal tactile representations

F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com

The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …

被引用次数：14 相关文章所有 4 个版本

[PDF] thecvf.com

Instruct-Imagen: Image generation with multi-modal instruction

H Hu, KCK Chan, YC Su, W Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract This paper presents Instruct-Imagen a model that tackles heterogeneous image
generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction …

被引用次数：9 相关文章所有 4 个版本

[PDF] arxiv.org

How many unicorns are in this image? a safety evaluation benchmark for vision llms

H Tu, C Cui, Z Wang, Y Zhou, B Zhao, J Han… - arXiv preprint arXiv …, 2023 - arxiv.org

This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning. Different
from prior studies, we shift our focus from evaluating standard performance to introducing a …

被引用次数：29 相关文章所有 2 个版本

[PDF] thecvf.com

Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

H Du, S Zhang, B Xie, G Nan, J Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Video anomaly understanding (VAU) aims to automatically comprehend unusual
occurrences in videos thereby enabling various applications such as traffic surveillance and …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Llm evaluators recognize and favor their own generations

A Panickssery, SR Bowman, S Feng - arXiv preprint arXiv:2404.13076, 2024 - arxiv.org

Self-evaluation using large language models (LLMs) has proven valuable not only in
benchmarking but also methods like reward modeling, constitutional AI, and self-refinement …

被引用次数：20 相关文章所有 3 个版本

高级搜索

QQ 群

Multimodal foundation models: From specialists to general-purpose assistants

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

SEED-Bench: Benchmarking Multimodal Large Language Models

Capsfusion: Rethinking image-text data at scale

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Binding touch to everything: Learning unified multimodal tactile representations

Instruct-Imagen: Image generation with multi-modal instruction

How many unicorns are in this image? a safety evaluation benchmark for vision llms

Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

Llm evaluators recognize and favor their own generations

引用