From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

All you may need for vqa are image captions

S Changpinyo, D Kukliansky, I Szpektor… - arXiv preprint arXiv …, 2022 - arxiv.org
Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but
has not enjoyed the same level of engagement in terms of data creation. In this paper, we …

Human-like controllable image captioning with verb-specific semantic roles

L Chen, Z Jiang, J Xiao, W Liu - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Abstract Controllable Image Captioning (CIC)--generating image descriptions following
designated control signals--has received unprecedented attention over the last few years …

Neural natural language generation: A survey on multilinguality, multimodality, controllability and learning

E Erdem, M Kuyu, S Yagcioglu, A Frank… - Journal of Artificial …, 2022 - jair.org
Developing artificial learning systems that can understand and generate natural language
has been one of the long-standing goals of artificial intelligence. Recent decades have …

Underspecification in scene description-to-depiction tasks

B Hutchinson, J Baldridge, V Prabhakaran - arXiv preprint arXiv …, 2022 - arxiv.org
Questions regarding implicitness, ambiguity and underspecification are crucial for
understanding the task validity and ethical concerns of multimodal image+ text systems, yet …

Memecap: A dataset for captioning and interpreting memes

EJ Hwang, V Shwartz - arXiv preprint arXiv:2305.13703, 2023 - arxiv.org
Memes are a widely popular tool for web users to express their thoughts using visual
metaphors. Understanding memes requires recognizing and interpreting visual metaphors …

Crossmodal-3600: A massively multilingual multimodal evaluation dataset

AV Thapliyal, J Pont-Tuset, X Chen… - arXiv preprint arXiv …, 2022 - arxiv.org
Research in massively multilingual image captioning has been severely hampered by a lack
of high-quality evaluation datasets. In this paper we present the Crossmodal-3600 dataset …

Visual language navigation: A survey and open challenges

SM Park, YG Kim - Artificial Intelligence Review, 2023 - Springer
With the recent development of deep learning, AI models are widely used in various
domains. AI models show good performance for definite tasks such as image classification …

Preserving semantic neighborhoods for robust cross-modal retrieval

C Thomas, A Kovashka - Computer Vision–ECCV 2020: 16th European …, 2020 - Springer
The abundance of multimodal data (eg social media posts) has inspired interest in cross-
modal retrieval methods. Popular approaches rely on a variety of metric learning losses …