Pixel-Aligned Language Model

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arXiv preprint arXiv …, 2024 - arxiv.org

Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

被引用次数：6 相关文章所有 4 个版本

[PDF] arxiv.org

Pivot: Iterative visual prompting elicits actionable knowledge for vlms

S Nasiriany, F Xia, W Yu, T Xiao, J Liang… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision language models (VLMs) have shown impressive capabilities across a variety of
tasks, from logical reasoning to visual understanding. This opens the door to richer …

被引用次数：15 相关文章所有 4 个版本

[PDF] arxiv.org

BRAVE: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

LocCa: Visual Pretraining with Location-aware Captioners

B Wan, M Tschannen, Y Xian, F Pavetic… - arXiv preprint arXiv …, 2024 - arxiv.org

Image captioning has been shown as an effective pretraining method similar to contrastive
pretraining. However, the incorporation of location-aware information into visual pretraining …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

J Wu, M Zhong, S Xing, Z Lai, Z Liu, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

J Yang, X Chen, N Madaan, M Iyengar, S Qian… - arXiv preprint arXiv …, 2024 - arxiv.org

The integration of language and 3D perception is crucial for developing embodied agents
and robots that comprehend and interact with the physical world. While large language …

高级搜索

QQ 群

The (r) evolution of multimodal large language models: A survey

Pivot: Iterative visual prompting elicits actionable knowledge for vlms

BRAVE: Broadening the visual encoding of vision-language models

LocCa: Visual Pretraining with Location-aware Captioners

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

引用