Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation

T Wu, G Yang, Z Li, K Zhang, Z Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Despite recent advances in text-to-3D generative methods there is a notable absence of
reliable evaluation metrics. Existing metrics usually focus on a single criterion each such as …

Gpt-4v (ision) is a generalist web agent, if grounded

B Zheng, B Gou, J Kil, H Sun, Y Su - arXiv preprint arXiv:2401.01614, 2024 - arxiv.org
The recent development on large multimodal models (LMMs), especially GPT-4V (ision) and
Gemini, has been quickly expanding the capability boundaries of multimodal models …

Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation

A Yan, Z Yang, W Zhu, K Lin, L Li, J Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user
interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as …

Evaluating text-to-visual generation with image-to-text generation

Z Lin, D Pathak, B Li, J Li, X Xia, G Neubig… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite significant progress in generative AI, comprehensive evaluation remains
challenging because of the lack of effective metrics and standardized benchmarks. For …

Gpt-4v (ision) as a social media analysis engine

H Lyu, J Huang, D Zhang, Y Yu, X Mou, J Pan… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent research has offered insights into the extraordinary capabilities of Large Multimodal
Models (LMMs) in various general vision and language tasks. There is growing interest in …

Ufo: A ui-focused agent for windows os interaction

C Zhang, L Li, S He, X Zhang, B Qiao, S Qin… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to
applications on Windows OS, harnessing the capabilities of GPT-Vision. UFO employs a …

Viescore: Towards explainable metrics for conditional image synthesis evaluation

M Ku, D Jiang, C Wei, X Yue, W Chen - arXiv preprint arXiv:2312.14867, 2023 - arxiv.org
In the rapidly advancing field of conditional image generation research, challenges such as
limited explainability lie in effectively evaluating the performance and capabilities of various …

Mllm-bench, evaluating multi-modal llms using gpt-4v

W Ge, S Chen, G Chen, J Chen, Z Chen, S Yan… - arXiv preprint arXiv …, 2023 - arxiv.org
In the pursuit of Artificial General Intelligence (AGI), the integration of vision in language
models has marked a significant milestone. The advent of vision-language models (MLLMs) …

Dreambench++: A human-aligned benchmark for personalized image generation

Y Peng, Y Cui, H Tang, Z Qi, R Dong, J Bai… - arXiv preprint arXiv …, 2024 - arxiv.org
Personalized image generation holds great promise in assisting humans in everyday work
and life due to its impressive function in creatively generating personalized content …

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

W Wang, K Mrini, L Yang, S Kumar, Y Tian… - arXiv preprint arXiv …, 2024 - arxiv.org
We propose a novel framework for filtering image-text data by leveraging fine-tuned
Multimodal Language Models (MLMs). Our approach outperforms predominant filtering …