From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Clipcap: Clip prefix for image captioning

R Mokady, A Hertz, AH Bermano - arXiv preprint arXiv:2111.09734, 2021 - arxiv.org
Image captioning is a fundamental task in vision-language understanding, where the model
predicts a textual informative caption to a given input image. In this paper, we present a …

Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts

C Guo, X Zuo, S Wang, L Cheng - European Conference on Computer …, 2022 - Springer
Inspired by the strong ties between vision and language, the two intimate human sensing
and communication modalities, our paper aims to explore the generation of 3D human full …

Multimodal transformer with multi-view visual representation for image captioning

J Yu, J Li, Z Yu, Q Huang - … on circuits and systems for video …, 2019 - ieeexplore.ieee.org
Image captioning aims to automatically generate a natural language description of a given
image, and most state-of-the-art models have adopted an encoder-decoder framework. The …

Region-aware image captioning via interaction learning

AA Liu, Y Zhai, N Xu, W Nie, W Li… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Image captioning is one of the primary goals in computer vision which aims to automatically
generate natural descriptions for images. Intuitively, human visual system can notice some …

More photos are all you need: Semi-supervised learning for fine-grained sketch based image retrieval

AK Bhunia, PN Chowdhury, A Sain… - Proceedings of the …, 2021 - openaccess.thecvf.com
A fundamental challenge faced by existing Fine-Grained Sketch-Based Image Retrieval (FG-
SBIR) models is the data scarcity--model performances are largely bottlenecked by the lack …

Fashion captioning: Towards generating accurate descriptions with semantic rewards

X Yang, H Zhang, D Jin, Y Liu, CH Wu, J Tan… - Computer Vision–ECCV …, 2020 - Springer
Generating accurate descriptions for online fashion items is important not only for enhancing
customers' shopping experiences, but also for the increase of online sales. Besides the …

Visuals to text: A comprehensive review on automatic image captioning

Y Ming, N Hu, C Fan, F Feng… - IEEE/CAA Journal of …, 2022 - researchportal.port.ac.uk
Image captioning refers to automatic generation of descriptive texts according to the visual
content of images. It is a technique integrating multiple disciplines including the computer …

Fine-grained image captioning with global-local discriminative objective

J Wu, T Chen, H Wu, Z Yang, G Luo… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Significant progress has been made in recent years in image captioning, an active topic in
the fields of vision and language. However, existing methods tend to yield overly general …