Connecting vision and language with localized narratives

J Pont-Tuset, J Uijlings, S Changpinyo… - Computer Vision–ECCV …, 2020 - Springer
Abstract We propose Localized Narratives, a new form of multimodal image annotations
connecting vision and language. We ask annotators to describe an image with their voice …

Learning visual representations with caption annotations

MB Sariyildiz, J Perez, D Larlus - … Conference, Glasgow, UK, August 23–28 …, 2020 - Springer
Pretraining general-purpose visual features has become a crucial part of tackling many
computer vision tasks. While one can learn such features on the extensively-annotated …

Towards unsupervised image captioning with shared multimodal embeddings

I Laina, C Rupprecht, N Navab - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Understanding images without explicit supervision has become an important problem in
computer vision. In this paper, we address image captioning by generating language …

Visual cluster grounding for image captioning

W Jiang, M Zhu, Y Fang, G Shi… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Attention mechanisms have been extensively adopted in vision and language tasks such as
image captioning. It encourages a captioning model to dynamically ground appropriate …

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

Adding chinese captions to images

X Li, W Lan, J Dong, H Liu - Proceedings of the 2016 ACM on …, 2016 - dl.acm.org
This paper extends research on automated image captioning in the dimension of language,
studying how to generate Chinese sentence descriptions for unlabeled images. To evaluate …

Neural baby talk

J Lu, J Yang, D Batra, D Parikh - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com
We introduce a novel framework for image captioning that can produce natural language
explicitly grounded in entities that object detectors find in the image. Our approach …

Know more say less: Image captioning based on scene graphs

X Li, S Jiang - IEEE Transactions on Multimedia, 2019 - ieeexplore.ieee.org
Automatically describing the content of an image has been attracting considerable research
attention in the multimedia field. To represent the content of an image, many approaches …

Visual clues: Bridging vision and language foundations for image paragraph captioning

Y Xie, L Zhou, X Dai, L Yuan, N Bach… - Advances in Neural …, 2022 - proceedings.neurips.cc
People say," A picture is worth a thousand words". Then how can we get the rich information
out of the image? We argue that by using visual clues to bridge large pretrained vision …

Show, control and tell: A framework for generating controllable and grounded captions

M Cornia, L Baraldi… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Current captioning approaches can describe images using black-box architectures whose
behavior is hardly controllable and explainable from the exterior. As an image can be …