Pix2struct: Screenshot parsing as pretraining for visual language understanding

A Dubey, A Jauhri, A Pandey, A Kadian… - arXiv preprint arXiv …, 2024 - arxiv.org

Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

被引用次数：220 相关文章所有 3 个版本

[PDF] thecvf.com

Cogagent: A visual language model for gui agents

W Hong, W Wang, Q Lv, J Xu, W Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

People are spending an enormous amount of time on digital devices through graphical user
interfaces (GUIs) eg computer or smartphone screens. Large language models (LLMs) such …

被引用次数：120 相关文章所有 3 个版本

[PDF] thecvf.com

Monkey: Image resolution and text label are important things for large multi-modal models

Z Li, B Yang, Q Liu, Z Ma, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …

被引用次数：105 相关文章所有 3 个版本

[PDF] arxiv.org

Webarena: A realistic web environment for building autonomous agents

S Zhou, FF Xu, H Zhu, X Zhou, R Lo, A Sridhar… - arXiv preprint arXiv …, 2023 - arxiv.org

With advances in generative AI, there is now potential for autonomous agents to manage
daily tasks via natural language commands. However, current agents are primarily created …

被引用次数：158 相关文章所有 4 个版本

[PDF] arxiv.org

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T Xia, J Liu, C Li, H Hajishirzi… - arXiv preprint arXiv …, 2023 - arxiv.org

Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit
impressive skills in various domains, their ability for mathematical reasoning within visual …

被引用次数：175 相关文章所有 3 个版本

[PDF] arxiv.org

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arXiv preprint arXiv:2304.01008, 2023 - arxiv.org

Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

被引用次数：27 相关文章所有 2 个版本

[PDF] arxiv.org

Pali-x: On scaling up a multilingual vision and language model

X Chen, J Djolonga, P Padlewski, B Mustafa… - arXiv preprint arXiv …, 2023 - arxiv.org

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …

被引用次数：124 相关文章所有 4 个版本

[PDF] neurips.cc

Image captioners are scalable vision learners too

M Tschannen, M Kumar, A Steiner… - Advances in …, 2024 - proceedings.neurips.cc

Contrastive pretraining on image-text pairs from the web is one of the most popular large-
scale pretraining strategies for vision backbones, especially in the context of large …

被引用次数：39 相关文章所有 5 个版本

[PDF] neurips.cc

From pixels to ui actions: Learning to follow instructions via graphical user interfaces

P Shaw, M Joshi, J Cohan, J Berant… - Advances in …, 2023 - proceedings.neurips.cc

Much of the previous work towards digital agents for graphical user interfaces (GUIs) has
relied on text-based representations (derived from HTML or other structured data sources) …

被引用次数：43 相关文章所有 5 个版本

[PDF] arxiv.org

mplug-docowl: Modularized multimodal large language model for document understanding

J Ye, A Hu, H Xu, Q Ye, M Yan, Y Dan, C Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org

Document understanding refers to automatically extract, analyze and comprehend
information from various types of digital documents, such as a web page. Existing Multi …

被引用次数：76 相关文章所有 3 个版本

高级搜索

QQ 群