Omnigen: Unified image generation

S Xiao, Y Wang, J Zhou, H Yuan, X Xing, R Yan… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we introduce OmniGen, a new diffusion model for unified image generation.
Unlike popular diffusion models (eg, Stable Diffusion), OmniGen no longer requires …

Densefusion-1m: Merging vision experts for comprehensive multimodal perception

X Li, F Zhang, H Diao, Y Wang, X Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex
understanding of various visual elements, including multiple objects, text information, and …

Naturalbench: Evaluating vision-language models on natural adversarial samples

B Li, Z Lin, W Peng, JD Nyandwi, D Jiang, Z Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …

Mmcomposition: Revisiting the compositionality of pre-trained vision-language models

H Hua, Y Tang, Z Zeng, L Cao, Z Yang, H He… - arXiv preprint arXiv …, 2024 - arxiv.org
The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal
understanding, enabling more sophisticated and accurate integration of visual and textual …

Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks

J Chen, T Liang, S Siu, Z Wang, K Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500
real-world tasks, to address the highly heterogeneous daily use cases of end users. Our …

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

YG Hsieh, CY Hsieh, SY Yeh, L Béthune… - arXiv preprint arXiv …, 2024 - arxiv.org
Humans describe complex scenes with compositionality, using simple text descriptions
enriched with links and relationships. While vision-language research has aimed to develop …

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

Y Tang, J Guo, H Hua, S Liang, M Feng, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org
The advancement of Multimodal Large Language Models (MLLMs) has enabled significant
progress in multimodal understanding, expanding their capacity to analyze video content …

MATE: Meet At The Embedding--Connecting Images with Long Texts

YK Jang, J Kang, YJ Lee, D Kim - arXiv preprint arXiv:2407.09541, 2024 - arxiv.org
While advancements in Vision Language Models (VLMs) have significantly improved the
alignment of visual and textual data, these models primarily focus on aligning images with …

Tulip: Token-length upgraded clip

I Najdenkoska, MM Derakhshani, YM Asano… - arXiv preprint arXiv …, 2024 - arxiv.org
We address the challenge of representing long captions in vision-language models, such as
CLIP. By design these models are limited by fixed, absolute positional encodings, restricting …

CAST: Cross-modal Alignment Similarity Test for Vision Language Models

G Dagan, O Loginova, A Batra - arXiv preprint arXiv:2409.11007, 2024 - arxiv.org
Vision Language Models (VLMs) are typically evaluated with Visual Question Answering
(VQA) tasks which assess a model's understanding of scenes. Good VQA performance is …