Pix2struct: Screenshot parsing as pretraining for visual language understanding

W Hong, W Wang, Q Lv, J Xu, W Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

People are spending an enormous amount of time on digital devices through graphical user
interfaces (GUIs) eg computer or smartphone screens. Large language models (LLMs) such …

被引用次数：72 相关文章所有 3 个版本

[PDF] thecvf.com

Monkey: Image resolution and text label are important things for large multi-modal models

Z Li, B Yang, Q Liu, Z Ma, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …

被引用次数：66 相关文章所有 3 个版本

[PDF] arxiv.org

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T Xia, J Liu, C Li, H Hajishirzi… - arXiv preprint arXiv …, 2023 - arxiv.org

Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit
impressive skills in various domains, their ability for mathematical reasoning within visual …

被引用次数：113 相关文章所有 3 个版本

[PDF] arxiv.org

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arXiv preprint arXiv:2304.01008, 2023 - arxiv.org

Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

被引用次数：18 相关文章所有 2 个版本

[PDF] arxiv.org

Pali-x: On scaling up a multilingual vision and language model

X Chen, J Djolonga, P Padlewski, B Mustafa… - arXiv preprint arXiv …, 2023 - arxiv.org

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …

被引用次数：99 相关文章所有 4 个版本

[PDF] arxiv.org

Webarena: A realistic web environment for building autonomous agents

S Zhou, FF Xu, H Zhu, X Zhou, R Lo, A Sridhar… - arXiv preprint arXiv …, 2023 - arxiv.org

With generative AI advances, the exciting potential for autonomous agents to manage daily
tasks via natural language commands has emerged. However, cur rent agents are primarily …

被引用次数：122 相关文章所有 4 个版本

[PDF] neurips.cc

Image captioners are scalable vision learners too

M Tschannen, M Kumar, A Steiner… - Advances in …, 2024 - proceedings.neurips.cc

Contrastive pretraining on image-text pairs from the web is one of the most popular large-
scale pretraining strategies for vision backbones, especially in the context of large …

被引用次数：32 相关文章所有 5 个版本

[PDF] neurips.cc

From pixels to ui actions: Learning to follow instructions via graphical user interfaces

P Shaw, M Joshi, J Cohan, J Berant… - Advances in …, 2023 - proceedings.neurips.cc

Much of the previous work towards digital agents for graphical user interfaces (GUIs) has
relied on text-based representations (derived from HTML or other structured data sources) …

被引用次数：37 相关文章所有 5 个版本

[PDF] arxiv.org

mplug-docowl: Modularized multimodal large language model for document understanding

J Ye, A Hu, H Xu, Q Ye, M Yan, Y Dan, C Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org

Document understanding refers to automatically extract, analyze and comprehend
information from various types of digital documents, such as a web page. Existing Multi …

被引用次数：57 相关文章所有 3 个版本

[PDF] arxiv.org

Pali-3 vision language models: Smaller, faster, stronger

X Chen, X Wang, L Beyer, A Kolesnikov, J Wu… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that
compares favorably to similar models that are 10x larger. As part of arriving at this strong …

被引用次数：36 相关文章所有 3 个版本

高级搜索

QQ 群