Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T Xia, J Liu, C Li, H Hajishirzi… - arXiv preprint arXiv …, 2023 - arxiv.org
Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit
impressive skills in various domains, their ability for mathematical reasoning within visual …

SEED-Bench: Benchmarking Multimodal Large Language Models

B Li, Y Ge, Y Ge, G Wang, R Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multimodal large language models (MLLMs) building upon the foundation of powerful large
language models (LLMs) have recently demonstrated exceptional capabilities in generating …

Capsfusion: Rethinking image-text data at scale

Q Yu, Q Sun, X Zhang, Y Cui, F Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Y Zeng, H Zhang, J Zheng, J Xia, G Wei… - Proceedings of the …, 2024 - aclanthology.org
Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in
processing image inputs and following open-ended instructions. Despite these …

Binding touch to everything: Learning unified multimodal tactile representations

F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com
The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …

Instruct-Imagen: Image generation with multi-modal instruction

H Hu, KCK Chan, YC Su, W Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract This paper presents Instruct-Imagen a model that tackles heterogeneous image
generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction …

How many unicorns are in this image? a safety evaluation benchmark for vision llms

H Tu, C Cui, Z Wang, Y Zhou, B Zhao, J Han… - arXiv preprint arXiv …, 2023 - arxiv.org
This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning. Different
from prior studies, we shift our focus from evaluating standard performance to introducing a …

Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

H Du, S Zhang, B Xie, G Nan, J Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Video anomaly understanding (VAU) aims to automatically comprehend unusual
occurrences in videos thereby enabling various applications such as traffic surveillance and …

Llm evaluators recognize and favor their own generations

A Panickssery, SR Bowman, S Feng - arXiv preprint arXiv:2404.13076, 2024 - arxiv.org
Self-evaluation using large language models (LLMs) has proven valuable not only in
benchmarking but also methods like reward modeling, constitutional AI, and self-refinement …