Decoupled box proposal and featurization with ultrafine-grained semantic labels improve image...

S Changpinyo, P Sharma, N Ding… - Proceedings of the …, 2021 - openaccess.thecvf.com

The availability of large-scale image captioning and visual question answering datasets has
contributed significantly to recent successes in vision-and-language pre-training. However …

被引用次数：802 相关文章所有 9 个版本

[PDF] thecvf.com

Injecting semantic concepts into end-to-end image captioning

Z Fang, J Wang, X Hu, L Liang, Z Gan… - Proceedings of the …, 2022 - openaccess.thecvf.com

Tremendous progress has been made in recent years in developing better image captioning
models, yet most of them rely on a separate object detector to extract regional features …

被引用次数：94 相关文章所有 9 个版本

[PDF] arxiv.org

Connecting vision and language with localized narratives

J Pont-Tuset, J Uijlings, S Changpinyo… - Computer Vision–ECCV …, 2020 - Springer

Abstract We propose Localized Narratives, a new form of multimodal image annotations
connecting vision and language. We ask annotators to describe an image with their voice …

被引用次数：217 相关文章所有 7 个版本

[PDF] thecvf.com

Look before you speak: Visually contextualized utterances

PH Seo, A Nagrani, C Schmid - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

While most conversational AI systems focus on textual dialogue only, conditioning
utterances on visual context (when it's available) can lead to more realistic conversations …

被引用次数：78 相关文章所有 6 个版本

[PDF] aaai.org

Re-attention for visual question answering

W Guo, Y Zhang, J Yang, X Yuan - IEEE Transactions on Image …, 2021 - ieeexplore.ieee.org

A simultaneous understanding of questions and images is crucial in Visual Question
Answering (VQA). While the existing models have achieved satisfactory performance by …

被引用次数：97 相关文章所有 10 个版本

[PDF] thecvf.com

SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text

PN Chowdhury, AK Bhunia, A Sain… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this paper, we extend scene understanding to include that of human sketch. The result is a
complete trilogy of scene representation from three diverse and complementary modalities …

被引用次数：19 相关文章所有 6 个版本

[PDF] arxiv.org

Answer-me: Multi-task open-vocabulary visual question answering

AJ Piergiovanni, W Li, W Kuo, M Saffar… - arXiv preprint arXiv …, 2022 - arxiv.org

We present Answer-Me, a task-aware multi-task framework which unifies a variety of
question answering tasks, such as, visual question answering, visual entailment, visual …

被引用次数：16 相关文章所有 2 个版本

[PDF] arxiv.org

Object-centric unsupervised image captioning

Z Meng, D Yang, X Cao, A Shah, SN Lim - European Conference on …, 2022 - Springer

Image captioning is a longstanding problem in the field of computer vision and natural
language processing. To date, researchers have produced impressive state-of-the-art …

被引用次数：14 相关文章所有 4 个版本

[PDF] aaai.org

Reinforcing an image caption generator using off-line human feedback

PH Seo, P Sharma, T Levinboim, B Han… - Proceedings of the AAAI …, 2020 - aaai.org

Human ratings are currently the most accurate way to assess the quality of an image
captioning model, yet most often the only used outcome of an expensive human rating …

被引用次数：26 相关文章所有 9 个版本

[PDF] arxiv.org

Quality estimation for image captions based on large-scale human evaluations

T Levinboim, AV Thapliyal, P Sharma… - arXiv preprint arXiv …, 2019 - arxiv.org

Automatic image captioning has improved significantly over the last few years, but the
problem is far from being solved, with state of the art models still often producing low quality …

被引用次数：26 相关文章所有 5 个版本

高级搜索

QQ 群