Prompt Highlighter: Interactive Control for Multi-Modal LLMs

Y Zhang, S Qian, B Peng, S Liu… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This study targets a critical aspect of multi-modal LLMs'(LLMs&VLMs) inference: explicit
controllable text generation. Multi-modal LLMs empower multi-modality understanding with …

Contrastive region guidance: Improving grounding in vision-language models without training

D Wan, J Cho, E Stengel-Eskin, M Bansal - arXiv preprint arXiv …, 2024 - arxiv.org
Highlighting particularly relevant regions of an image can improve the performance of vision-
language models (VLMs) on various vision-language (VL) tasks by guiding the model to …

Prefix-diffusion: A lightweight diffusion model for diverse image captioning

G Liu, Y Li, Z Fei, H Fu, X Luo, Y Guo - arXiv preprint arXiv:2309.04965, 2023 - arxiv.org
While impressive performance has been achieved in image captioning, the limited diversity
of the generated captions and the large parameter scale remain major barriers to the real …

Emergent Visual-Semantic Hierarchies in Image-Text Representations

M Alper, H Averbuch-Elor - arXiv preprint arXiv:2407.08521, 2024 - arxiv.org
While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing
text and images in a shared semantic space, they do not explicitly model the hierarchical …

Tag‐inferring and tag‐guided Transformer for image captioning

Y Yi, Y Liang, D Kong, Z Tang, J Peng - IET Computer Vision, 2024 - Wiley Online Library
Image captioning is an important task for understanding images. Recently, many studies
have used tags to build alignments between image information and language information …