Language models for image captioning: The quirks and what works

J Devlin, H Cheng, H Fang, S Gupta, L Deng… - arXiv preprint arXiv …, 2015 - arxiv.org
… , and then a maximum entropy (ME) language model is used to arrange these words into a
… In this paper, we compare the merits of these different language modeling approaches for …

Visualgpt: Data-efficient adaptation of pretrained language models for image captioning

J Chen, H Guo, K Yi, B Li… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
… for image captioning by utilizing pretrained language models (… image captioning. To our
knowledge, this is the first work that … large pretrained language models for image captioning. …

An empirical study of language cnn for image captioning

J Gu, G Wang, J Cai, T Chen - Proceedings of the IEEE …, 2017 - openaccess.thecvf.com
… statistical language modeling tasks and shows competitive performance in image captioning
In this work, we present an image captioning model with language CNN to explore both …

[HTML][HTML] Image captioning for effective use of language models in knowledge-based visual question answering

A Salaberria, G Azkune, OL de Lacalle, A Soroa… - Expert Systems with …, 2023 - Elsevier
image captioning as a way to verbalize the information in the image, where the captions
are … Once the captions are generated, all the inference in our method is done using text-only …

Scaling up vision-language pre-training for image captioning

X Hu, Z Gan, J Wang, Z Yang, Z Liu… - Proceedings of the …, 2022 - openaccess.thecvf.com
image captioning, provide a more comprehensive study on the scaling behavior via altering
data and model … as predicting the next token with language modeling, as shown in Figure 4 …

Image captioning with deep bidirectional LSTMs

C Wang, H Yang, C Bartz, C Meinel - Proceedings of the 24th ACM …, 2016 - dl.acm.org
… for image captioning with multimodal neural language model… with structure-content neural
language model (SC-NLM). … which replaces feed-forward neural language model in [13]. …

Unified vision-language pre-training for image captioning and vqa

L Zhou, H Palangi, L Zhang, H Hu, J Corso… - Proceedings of the AAAI …, 2020 - ojs.aaai.org
… We observe that compared to the two cases where we do not use any pre-trained model
or use only the pre-trained language model (ie, BERT), using VLP significantly speedups the …

Clipcap: Clip prefix for image captioning

R Mokady, A Hertz, AH Bermano - arXiv preprint arXiv:2111.09734, 2021 - arxiv.org
… a language model to generate the image captions. The recently proposed CLIP model contains
rich … it best for vision-language perception. Our key idea is that together with a pre-trained …

Fusion models for improved image captioning

M Kalimuthu, A Mogadala, M Mosbach… - Pattern Recognition. ICPR …, 2021 - Springer
… Building on these developments, we propose to incorporate external language models
into visual captioning frameworks to aid and improve their capabilities both for description …

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com
… To address this challenge, we leverage existing captions and explore augmenting them with
… the original captions using a large language model (LLM), yielding comprehensive image