Cogagent: A visual language model for gui agents

W Hong, W Wang, Q Lv, J Xu, W Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com
People are spending an enormous amount of time on digital devices through graphical user
interfaces (GUIs) eg computer or smartphone screens. Large language models (LLMs) such …

Monkey: Image resolution and text label are important things for large multi-modal models

Z Li, B Yang, Q Liu, Z Ma, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T Xia, J Liu, C Li, H Hajishirzi… - arXiv preprint arXiv …, 2023 - arxiv.org
Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit
impressive skills in various domains, their ability for mathematical reasoning within visual …

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arXiv preprint arXiv:2304.01008, 2023 - arxiv.org
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

Pali-x: On scaling up a multilingual vision and language model

X Chen, J Djolonga, P Padlewski, B Mustafa… - arXiv preprint arXiv …, 2023 - arxiv.org
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …

Webarena: A realistic web environment for building autonomous agents

S Zhou, FF Xu, H Zhu, X Zhou, R Lo, A Sridhar… - arXiv preprint arXiv …, 2023 - arxiv.org
With generative AI advances, the exciting potential for autonomous agents to manage daily
tasks via natural language commands has emerged. However, cur rent agents are primarily …

Image captioners are scalable vision learners too

M Tschannen, M Kumar, A Steiner… - Advances in …, 2024 - proceedings.neurips.cc
Contrastive pretraining on image-text pairs from the web is one of the most popular large-
scale pretraining strategies for vision backbones, especially in the context of large …

From pixels to ui actions: Learning to follow instructions via graphical user interfaces

P Shaw, M Joshi, J Cohan, J Berant… - Advances in …, 2023 - proceedings.neurips.cc
Much of the previous work towards digital agents for graphical user interfaces (GUIs) has
relied on text-based representations (derived from HTML or other structured data sources) …

mplug-docowl: Modularized multimodal large language model for document understanding

J Ye, A Hu, H Xu, Q Ye, M Yan, Y Dan, C Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org
Document understanding refers to automatically extract, analyze and comprehend
information from various types of digital documents, such as a web page. Existing Multi …

Pali-3 vision language models: Smaller, faster, stronger

X Chen, X Wang, L Beyer, A Kolesnikov, J Wu… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that
compares favorably to similar models that are 10x larger. As part of arriving at this strong …