Guiding image captioning models toward more specific captions

Y Zhang, S Qian, B Peng, S Liu… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

This study targets a critical aspect of multi-modal LLMs'(LLMs&VLMs) inference: explicit
controllable text generation. Multi-modal LLMs empower multi-modality understanding with …

被引用次数：2 相关文章所有 4 个版本

[PDF] arxiv.org

Contrastive region guidance: Improving grounding in vision-language models without training

D Wan, J Cho, E Stengel-Eskin, M Bansal - arXiv preprint arXiv …, 2024 - arxiv.org

Highlighting particularly relevant regions of an image can improve the performance of vision-
language models (VLMs) on various vision-language (VL) tasks by guiding the model to …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Prefix-diffusion: A lightweight diffusion model for diverse image captioning

G Liu, Y Li, Z Fei, H Fu, X Luo, Y Guo - arXiv preprint arXiv:2309.04965, 2023 - arxiv.org

While impressive performance has been achieved in image captioning, the limited diversity
of the generated captions and the large parameter scale remain major barriers to the real …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Emergent Visual-Semantic Hierarchies in Image-Text Representations

M Alper, H Averbuch-Elor - arXiv preprint arXiv:2407.08521, 2024 - arxiv.org

While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing
text and images in a shared semantic space, they do not explicitly model the hierarchical …

Tag‐inferring and tag‐guided Transformer for image captioning

Y Yi, Y Liang, D Kong, Z Tang, J Peng - IET Computer Vision, 2024 - Wiley Online Library

Image captioning is an important task for understanding images. Recently, many studies
have used tags to build alignments between image information and language information …

高级搜索

QQ 群