Scaling up vision-language pre-training for image captioning

X Hu, Z Gan, J Wang, Z Yang, Z Liu… - Proceedings of the …, 2022 - openaccess.thecvf.com
In recent years, we have witnessed significant performance boost in the image captioning
task based on vision-language pre-training (VLP). Scale is believed to be an important factor …

Smallcap: lightweight image captioning prompted with retrieval augmentation

R Ramos, B Martins, D Elliott… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recent advances in image captioning have focused on scaling the data and model size,
substantially increasing the cost of pre-training and finetuning. As an alternative to large …

Nocaps: Novel object captioning at scale

H Agrawal, K Desai, Y Wang, X Chen… - Proceedings of the …, 2019 - openaccess.thecvf.com
Image captioning models have achieved impressive results on datasets containing limited
visual concepts and large amounts of paired image-caption training data. However, if these …

Visualgpt: Data-efficient adaptation of pretrained language models for image captioning

J Chen, H Guo, K Yi, B Li… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
The limited availability of annotated data often hinders real-world applications of machine
learning. To efficiently learn from small quantities of multimodal data, we leverage the …

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

Pointing novel objects in image captioning

Y Li, T Yao, Y Pan, H Chao… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Image captioning has received significant attention with remarkable improvements in recent
advances. Nevertheless, images in the wild encapsulate rich knowledge and cannot be …

Neural baby talk

J Lu, J Yang, D Batra, D Parikh - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com
We introduce a novel framework for image captioning that can produce natural language
explicitly grounded in entities that object detectors find in the image. Our approach …

Learning to collocate neural modules for image captioning

X Yang, H Zhang, J Cai - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com
We do not speak word by word from scratch; our brain quickly structures a pattern like sth do
sth at someplace and then fill in the detailed description. To render existing encoder …

Injecting semantic concepts into end-to-end image captioning

Z Fang, J Wang, X Hu, L Liang, Z Gan… - Proceedings of the …, 2022 - openaccess.thecvf.com
Tremendous progress has been made in recent years in developing better image captioning
models, yet most of them rely on a separate object detector to extract regional features …

Clipcap: Clip prefix for image captioning

R Mokady, A Hertz, AH Bermano - arXiv preprint arXiv:2111.09734, 2021 - arxiv.org
Image captioning is a fundamental task in vision-language understanding, where the model
predicts a textual informative caption to a given input image. In this paper, we present a …