Mantis: Interleaved multi-image instruction tuning

M Liu, Z Xu, Z Lin, T Ashby, J Rimchala, J Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

Interleaved text-and-image generation has been an intriguing research direction, where the
models are required to generate both images and text pieces in an arbitrary order. Despite …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models

E Yiu, M Qraitem, C Wong, AN Majhi, Y Bai… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper investigates visual analogical reasoning in large multimodal models (LMMs)
compared to human adults and children. A" visual analogy" is an abstract rule inferred from …

[PDF] arxiv.org

Unveiling Encoder-Free Vision-Language Models

H Diao, Y Cui, X Li, Y Wang, H Lu, X Wang - arXiv preprint arXiv …, 2024 - arxiv.org

Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual
features followed by large language models (LLMs) for visual-language tasks. However, the …

被引用次数：1 相关文章

[PDF] arxiv.org

CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

J Kil, Z Mai, J Lee, Z Wang, K Cheng, L Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

The ability to compare objects, scenes, or situations is crucial for effective decision-making
and problem-solving in everyday life. For instance, comparing the freshness of apples …

相关文章所有 2 个版本

[PDF] arxiv.org

OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding

T Zhao, Q Zhang, K Lee, P Liu, L Zhang, C Fang… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce OmChat, a model designed to excel in handling long contexts and video
understanding tasks. OmChat's new architecture standardizes how different visual inputs are …

相关文章所有 2 个版本

[PDF] arxiv.org

GenAI Arena: An Open Evaluation Platform for Generative Models

D Jiang, M Ku, T Li, Y Ni, S Sun, R Fan… - arXiv preprint arXiv …, 2024 - arxiv.org

Generative AI has made remarkable strides to revolutionize fields such as image and video
generation. These advancements are driven by innovative algorithms, architecture, and …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning

M Dogan, I Kesen, I Calixto, A Erdem… - arXiv preprint arXiv …, 2024 - arxiv.org

The linguistic capabilities of Multimodal Large Language Models (MLLMs) are critical for
their effective application across diverse tasks. This study aims to evaluate the performance …

相关文章所有 2 个版本

[PDF] arxiv.org

MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

X He, D Jiang, G Zhang, M Ku, A Soni, S Siu… - arXiv preprint arXiv …, 2024 - arxiv.org

The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …

相关文章所有 2 个版本

[PDF] arxiv.org

Lateralization LoRA: Interleaved Instruction Tuning with Modality-Specialized Adaptations

Z Xu, M Liu, Y Shen, J Rimchala, J Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent advancements in Vision-Language Models (VLMs) have led to the development of
Vision-Language Generalists (VLGs) capable of understanding and generating interleaved …

相关文章所有 2 个版本

[PDF] arxiv.org

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

H Liu, X Zhang, H Xu, Y Shi, C Jiang, M Yan… - arXiv preprint arXiv …, 2024 - arxiv.org

Built on the power of LLMs, numerous multimodal large language models (MLLMs) have
recently achieved remarkable performance on various vision-language tasks across multiple …

相关文章所有 2 个版本

高级搜索

QQ 群