Holistic Evaluation for Interleaved Text-and-Image Generation

M Liu, Z Xu, Z Lin, T Ashby, J Rimchala, J Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Interleaved text-and-image generation has been an intriguing research direction, where the
models are required to generate both images and text pieces in an arbitrary order. Despite …

KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models

E Yiu, M Qraitem, C Wong, AN Majhi, Y Bai… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper investigates visual analogical reasoning in large multimodal models (LMMs)
compared to human adults and children. A" visual analogy" is an abstract rule inferred from …

Unveiling Encoder-Free Vision-Language Models

H Diao, Y Cui, X Li, Y Wang, H Lu, X Wang - arXiv preprint arXiv …, 2024 - arxiv.org
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual
features followed by large language models (LLMs) for visual-language tasks. However, the …

CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

J Kil, Z Mai, J Lee, Z Wang, K Cheng, L Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
The ability to compare objects, scenes, or situations is crucial for effective decision-making
and problem-solving in everyday life. For instance, comparing the freshness of apples …

OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding

T Zhao, Q Zhang, K Lee, P Liu, L Zhang, C Fang… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce OmChat, a model designed to excel in handling long contexts and video
understanding tasks. OmChat's new architecture standardizes how different visual inputs are …

GenAI Arena: An Open Evaluation Platform for Generative Models

D Jiang, M Ku, T Li, Y Ni, S Sun, R Fan… - arXiv preprint arXiv …, 2024 - arxiv.org
Generative AI has made remarkable strides to revolutionize fields such as image and video
generation. These advancements are driven by innovative algorithms, architecture, and …

Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning

M Dogan, I Kesen, I Calixto, A Erdem… - arXiv preprint arXiv …, 2024 - arxiv.org
The linguistic capabilities of Multimodal Large Language Models (MLLMs) are critical for
their effective application across diverse tasks. This study aims to evaluate the performance …

MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

X He, D Jiang, G Zhang, M Ku, A Soni, S Siu… - arXiv preprint arXiv …, 2024 - arxiv.org
The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …

Lateralization LoRA: Interleaved Instruction Tuning with Modality-Specialized Adaptations

Z Xu, M Liu, Y Shen, J Rimchala, J Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in Vision-Language Models (VLMs) have led to the development of
Vision-Language Generalists (VLGs) capable of understanding and generating interleaved …

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

H Liu, X Zhang, H Xu, Y Shi, C Jiang, M Yan… - arXiv preprint arXiv …, 2024 - arxiv.org
Built on the power of LLMs, numerous multimodal large language models (MLLMs) have
recently achieved remarkable performance on various vision-language tasks across multiple …