Minigpt-4: Enhancing vision-language understanding with advanced large language models

D Zhu, J Chen, X Shen, X Li, M Elhoseiny - arXiv preprint arXiv …, 2023 - arxiv.org
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …

Minivlm: A smaller and faster vision-language model

J Wang, X Hu, P Zhang, X Li, L Wang, L Zhang… - arXiv preprint arXiv …, 2020 - arxiv.org
Recent vision-language (VL) studies have shown remarkable progress by learning generic
representations from massive image-text pairs with transformer models and then fine-tuning …

Regiongpt: Towards region understanding vision language model

Q Guo, S De Mello, H Yin, W Byeon… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision language models (VLMs) have experienced rapid advancements through the
integration of large language models (LLMs) with image-text pairs yet they struggle with …

Llama-adapter v2: Parameter-efficient visual instruction model

P Gao, J Han, R Zhang, Z Lin, S Geng, A Zhou… - arXiv preprint arXiv …, 2023 - arxiv.org
How to efficiently transform large language models (LLMs) into instruction followers is
recently a popular research direction, while training LLM for multi-modal reasoning remains …

e-vil: A dataset and benchmark for natural language explanations in vision-language tasks

M Kayser, OM Camburu, L Salewski… - Proceedings of the …, 2021 - openaccess.thecvf.com
Recently, there has been an increasing number of efforts to introduce models capable of
generating natural language explanations (NLEs) for their predictions on vision-language …

Kosmos-g: Generating images in context with multimodal large language models

X Pan, L Dong, S Huang, Z Peng, W Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I)
generation have made significant strides. However, the generation from generalized vision …

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

S Gu, C Clark, A Kembhavi - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Many high-level skills that are required for computer vision tasks, such as parsing questions,
comparing and contrasting semantics, and writing descriptions, are also required in other …

Allava: Harnessing gpt4v-synthesized data for a lite vision-language model

GH Chen, S Chen, R Zhang, J Chen, X Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in Large Vision-Language Models (LVLMs) have enabled processing
of multimodal inputs in language models but require significant computational resources for …

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions

D Zhu, J Chen, K Haydarov, X Shen, W Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Asking insightful questions is crucial for acquiring knowledge and expanding our
understanding of the world. However, the importance of questioning has been largely …