相关文章- 学术资源搜索

Minigpt-4: Enhancing vision-language understanding with advanced large language models

D Zhu, J Chen, X Shen, X Li, M Elhoseiny - arXiv preprint arXiv …, 2023 - arxiv.org

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …

被引用次数：1295 相关文章所有 7 个版本

[PDF] arxiv.org

Minivlm: A smaller and faster vision-language model

J Wang, X Hu, P Zhang, X Li, L Wang, L Zhang… - arXiv preprint arXiv …, 2020 - arxiv.org

Recent vision-language (VL) studies have shown remarkable progress by learning generic
representations from massive image-text pairs with transformer models and then fine-tuning …

被引用次数：45 相关文章所有 2 个版本

[PDF] thecvf.com

Regiongpt: Towards region understanding vision language model

Q Guo, S De Mello, H Yin, W Byeon… - Proceedings of the …, 2024 - openaccess.thecvf.com

Vision language models (VLMs) have experienced rapid advancements through the
integration of large language models (LLMs) with image-text pairs yet they struggle with …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Llama-adapter v2: Parameter-efficient visual instruction model

P Gao, J Han, R Zhang, Z Lin, S Geng, A Zhou… - arXiv preprint arXiv …, 2023 - arxiv.org

How to efficiently transform large language models (LLMs) into instruction followers is
recently a popular research direction, while training LLM for multi-modal reasoning remains …

被引用次数：346 相关文章所有 3 个版本

[PDF] thecvf.com

e-vil: A dataset and benchmark for natural language explanations in vision-language tasks

M Kayser, OM Camburu, L Salewski… - Proceedings of the …, 2021 - openaccess.thecvf.com

Recently, there has been an increasing number of efforts to introduce models capable of
generating natural language explanations (NLEs) for their predictions on vision-language …

被引用次数：84 相关文章所有 10 个版本

[PDF] arxiv.org

Kosmos-g: Generating images in context with multimodal large language models

X Pan, L Dong, S Huang, Z Peng, W Chen… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I)
generation have made significant strides. However, the generation from generalized vision …

被引用次数：27 相关文章所有 3 个版本

[PDF] thecvf.com

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

S Gu, C Clark, A Kembhavi - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Many high-level skills that are required for computer vision tasks, such as parsing questions,
comparing and contrasting semantics, and writing descriptions, are also required in other …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Allava: Harnessing gpt4v-synthesized data for a lite vision-language model

GH Chen, S Chen, R Zhang, J Chen, X Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent advancements in Large Vision-Language Models (LVLMs) have enabled processing
of multimodal inputs in language models but require significant computational resources for …

被引用次数：30 相关文章

[PDF] thecvf.com

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com

The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions

D Zhu, J Chen, K Haydarov, X Shen, W Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

Asking insightful questions is crucial for acquiring knowledge and expanding our
understanding of the world. However, the importance of questioning has been largely …

被引用次数：71 相关文章所有 5 个版本

高级搜索

QQ 群

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Minivlm: A smaller and faster vision-language model

Regiongpt: Towards region understanding vision language model

Llama-adapter v2: Parameter-efficient visual instruction model

e-vil: A dataset and benchmark for natural language explanations in vision-language tasks

Kosmos-g: Generating images in context with multimodal large language models

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

Allava: Harnessing gpt4v-synthesized data for a lite vision-language model

Fusecap: Leveraging large language models for enriched fused image captions

Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions

引用