Infographicvqa

L Shen, E Shen, Y Luo, X Yang, X Hu… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Utilizing Visualization-oriented Natural Language Interfaces (V-NLI) as a complementary
input modality to direct manipulation for visual analytics can provide an engaging user …

被引用次数：121 相关文章所有 10 个版本

[PDF] thecvf.com

Cogagent: A visual language model for gui agents

W Hong, W Wang, Q Lv, J Xu, W Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

People are spending an enormous amount of time on digital devices through graphical user
interfaces (GUIs) eg computer or smartphone screens. Large language models (LLMs) such …

被引用次数：107 相关文章所有 3 个版本

[PDF] thecvf.com

Monkey: Image resolution and text label are important things for large multi-modal models

Z Li, B Yang, Q Liu, Z Ma, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …

被引用次数：99 相关文章所有 3 个版本

[PDF] arxiv.org

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

M Reid, N Savinov, D Teplyashin, D Lepikhin… - arXiv preprint arXiv …, 2024 - arxiv.org

In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly
compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning …

被引用次数：284 相关文章所有 4 个版本

[PDF] thecvf.com

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

被引用次数：58 相关文章所有 4 个版本

[PDF] arxiv.org

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T Xia, J Liu, C Li, H Hajishirzi… - arXiv preprint arXiv …, 2023 - arxiv.org

Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit
impressive skills in various domains, their ability for mathematical reasoning within visual …

被引用次数：156 相关文章所有 3 个版本

[PDF] arxiv.org

Pali-x: On scaling up a multilingual vision and language model

X Chen, J Djolonga, P Padlewski, B Mustafa… - arXiv preprint arXiv …, 2023 - arxiv.org

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …

被引用次数：116 相关文章所有 4 个版本

[PDF] aaai.org

Bliva: A simple multimodal llm for better handling of text-rich visual questions

W Hu, Y Xu, Y Li, W Li, Z Chen, Z Tu - Proceedings of the AAAI …, 2024 - ojs.aaai.org

Vision Language Models (VLMs), which extend Large Language Models (LLM) by
incorporating visual understanding capability, have demonstrated significant advancements …

被引用次数：79 相关文章所有 3 个版本

[PDF] thecvf.com

Unifying vision, text, and layout for universal document processing

Z Tang, Z Yang, G Wang, Y Fang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We propose Universal Document Processing (UDOP), a foundation Document AI
model which unifies text, image, and layout modalities together with varied task formats …

被引用次数：73 相关文章所有 6 个版本

[PDF] arxiv.org

Mm1: Methods, analysis & insights from multimodal llm pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - arXiv preprint arXiv …, 2024 - arxiv.org

In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

被引用次数：73 相关文章所有 2 个版本

高级搜索

QQ 群

Towards natural language interfaces for data visualization: A survey

Cogagent: A visual language model for gui agents

Monkey: Image resolution and text label are important things for large multi-modal models

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pali-x: On scaling up a multilingual vision and language model

Bliva: A simple multimodal llm for better handling of text-rich visual questions

Unifying vision, text, and layout for universal document processing

Mm1: Methods, analysis & insights from multimodal llm pre-training

引用