Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Q-align: Teaching lmms for visual scoring via discrete text-defined levels

H Wu, Z Zhang, W Zhang, C Chen, L Liao, C Li… - arXiv preprint arXiv …, 2023 - arxiv.org
The explosion of visual content available online underscores the requirement for an
accurate machine assessor to robustly evaluate scores across diverse types of visual …

Towards open-ended visual quality comparison

H Wu, H Zhu, Z Zhang, E Zhang, C Chen, L Liao… - arXiv preprint arXiv …, 2024 - arxiv.org
Comparative settings (eg pairwise choice, listwise ranking) have been adopted by a wide
range of subjective studies for image quality assessment (IQA), as it inherently standardizes …

Aigc-vqa: A holistic perception metric for aigc video quality assessment

Y Lu, X Li, B Li, Z Yu, F Guan, X Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
With the development of generative models such as the diffusion model and auto-regressive
model AI-generated content (AIGC) is experiencing an explosive growth. Moreover existing …

Quality Assessment in the Era of Large Models: A Survey

Z Zhang, Y Zhou, C Li, B Zhao, X Liu, G Zhai - arXiv preprint arXiv …, 2024 - arxiv.org
Quality assessment, which evaluates the visual quality level of multimedia experiences, has
garnered significant attention from researchers and has evolved substantially through …

Aesbench: An expert benchmark for multimodal large language models on image aesthetics perception

Y Huang, Q Yuan, X Sheng, Z Yang, H Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
With collective endeavors, multimodal large language models (MLLMs) are undergoing a
flourishing development. However, their performances on image aesthetics perception …

A comprehensive study of multimodal large language models for image quality assessment

T Wu, K Ma, J Liang, Y Yang, L Zhang - arXiv preprint arXiv:2403.10854, 2024 - arxiv.org
While Multimodal Large Language Models (MLLMs) have experienced significant
advancement on visual understanding and reasoning, their potentials to serve as powerful …

Versat2i: Improving text-to-image models with versatile reward

J Guo, W Chai, J Deng, HW Huang, T Ye, Y Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent text-to-image (T2I) models have benefited from large-scale and high-quality data,
demonstrating impressive performance. However, these T2I models still struggle to produce …

Sharegpt4video: Improving video understanding and generation with better captions

L Chen, X Wei, J Li, X Dong, P Zhang, Y Zang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large
video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) …

Q-boost: On visual quality assessment ability of low-level multi-modality foundation models

Z Zhang, H Wu, Z Ji, C Li, E Zhang… - … on Multimedia and …, 2024 - ieeexplore.ieee.org
Recent advancements in Multi-modality Large Language Models (MLLMs) have
demonstrated remarkable capabilities in complex high-level vision tasks. However, the …