相关文章- 学术资源搜索

Connecting vision and language with localized narratives

J Pont-Tuset, J Uijlings, S Changpinyo… - Computer Vision–ECCV …, 2020 - Springer

Abstract We propose Localized Narratives, a new form of multimodal image annotations
connecting vision and language. We ask annotators to describe an image with their voice …

被引用次数：217 相关文章所有 7 个版本

[PDF] arxiv.org

Learning visual representations with caption annotations

MB Sariyildiz, J Perez, D Larlus - … Conference, Glasgow, UK, August 23–28 …, 2020 - Springer

Pretraining general-purpose visual features has become a crucial part of tackling many
computer vision tasks. While one can learn such features on the extensively-annotated …

被引用次数：162 相关文章所有 4 个版本

[PDF] thecvf.com

Towards unsupervised image captioning with shared multimodal embeddings

I Laina, C Rupprecht, N Navab - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Understanding images without explicit supervision has become an important problem in
computer vision. In this paper, we address image captioning by generating language …

被引用次数：124 相关文章所有 14 个版本

Visual cluster grounding for image captioning

W Jiang, M Zhu, Y Fang, G Shi… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Attention mechanisms have been extensively adopted in vision and language tasks such as
image captioning. It encourages a captioning model to dynamically ground appropriate …

被引用次数：24 相关文章所有 5 个版本

[PDF] thecvf.com

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com

The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

被引用次数：8 相关文章所有 4 个版本

[PDF] github.io

Adding chinese captions to images

X Li, W Lan, J Dong, H Liu - Proceedings of the 2016 ACM on …, 2016 - dl.acm.org

This paper extends research on automated image captioning in the dimension of language,
studying how to generate Chinese sentence descriptions for unlabeled images. To evaluate …

被引用次数：92 相关文章所有 4 个版本

[PDF] thecvf.com

Neural baby talk

J Lu, J Yang, D Batra, D Parikh - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com

We introduce a novel framework for image captioning that can produce natural language
explicitly grounded in entities that object detectors find in the image. Our approach …

被引用次数：550 相关文章所有 9 个版本

[PDF] ict.ac.cn

Know more say less: Image captioning based on scene graphs

X Li, S Jiang - IEEE Transactions on Multimedia, 2019 - ieeexplore.ieee.org

Automatically describing the content of an image has been attracting considerable research
attention in the multimedia field. To represent the content of an image, many approaches …

被引用次数：181 相关文章所有 5 个版本

[PDF] neurips.cc

Visual clues: Bridging vision and language foundations for image paragraph captioning

Y Xie, L Zhou, X Dai, L Yuan, N Bach… - Advances in Neural …, 2022 - proceedings.neurips.cc

People say," A picture is worth a thousand words". Then how can we get the rich information
out of the image? We argue that by using visual clues to bridge large pretrained vision …

被引用次数：22 相关文章所有 7 个版本

[PDF] thecvf.com

Show, control and tell: A framework for generating controllable and grounded captions

M Cornia, L Baraldi… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Current captioning approaches can describe images using black-box architectures whose
behavior is hardly controllable and explainable from the exterior. As an image can be …

被引用次数：208 相关文章所有 14 个版本

高级搜索

QQ 群

Connecting vision and language with localized narratives

Learning visual representations with caption annotations

Towards unsupervised image captioning with shared multimodal embeddings

Visual cluster grounding for image captioning

Fusecap: Leveraging large language models for enriched fused image captions

Adding chinese captions to images

Neural baby talk

Know more say less: Image captioning based on scene graphs

Visual clues: Bridging vision and language foundations for image paragraph captioning

Show, control and tell: A framework for generating controllable and grounded captions

相关搜索

引用