Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

S Changpinyo, P Sharma, N Ding… - Proceedings of the …, 2021 - openaccess.thecvf.com
The availability of large-scale image captioning and visual question answering datasets has
contributed significantly to recent successes in vision-and-language pre-training. However …

Injecting semantic concepts into end-to-end image captioning

Z Fang, J Wang, X Hu, L Liang, Z Gan… - Proceedings of the …, 2022 - openaccess.thecvf.com
Tremendous progress has been made in recent years in developing better image captioning
models, yet most of them rely on a separate object detector to extract regional features …

Connecting vision and language with localized narratives

J Pont-Tuset, J Uijlings, S Changpinyo… - Computer Vision–ECCV …, 2020 - Springer
Abstract We propose Localized Narratives, a new form of multimodal image annotations
connecting vision and language. We ask annotators to describe an image with their voice …

Look before you speak: Visually contextualized utterances

PH Seo, A Nagrani, C Schmid - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
While most conversational AI systems focus on textual dialogue only, conditioning
utterances on visual context (when it's available) can lead to more realistic conversations …

Re-attention for visual question answering

W Guo, Y Zhang, J Yang, X Yuan - IEEE Transactions on Image …, 2021 - ieeexplore.ieee.org
A simultaneous understanding of questions and images is crucial in Visual Question
Answering (VQA). While the existing models have achieved satisfactory performance by …

SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text

PN Chowdhury, AK Bhunia, A Sain… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this paper, we extend scene understanding to include that of human sketch. The result is a
complete trilogy of scene representation from three diverse and complementary modalities …

Answer-me: Multi-task open-vocabulary visual question answering

AJ Piergiovanni, W Li, W Kuo, M Saffar… - arXiv preprint arXiv …, 2022 - arxiv.org
We present Answer-Me, a task-aware multi-task framework which unifies a variety of
question answering tasks, such as, visual question answering, visual entailment, visual …

Object-centric unsupervised image captioning

Z Meng, D Yang, X Cao, A Shah, SN Lim - European Conference on …, 2022 - Springer
Image captioning is a longstanding problem in the field of computer vision and natural
language processing. To date, researchers have produced impressive state-of-the-art …

Reinforcing an image caption generator using off-line human feedback

PH Seo, P Sharma, T Levinboim, B Han… - Proceedings of the AAAI …, 2020 - aaai.org
Human ratings are currently the most accurate way to assess the quality of an image
captioning model, yet most often the only used outcome of an expensive human rating …

Quality estimation for image captions based on large-scale human evaluations

T Levinboim, AV Thapliyal, P Sharma… - arXiv preprint arXiv …, 2019 - arxiv.org
Automatic image captioning has improved significantly over the last few years, but the
problem is far from being solved, with state of the art models still often producing low quality …