From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

A comprehensive survey of deep learning for image captioning

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019 - dl.acm.org
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

Rstnet: Captioning with adaptive attention on visual and non-visual words

X Zhang, X Sun, Y Luo, J Ji, Y Zhou… - Proceedings of the …, 2021 - openaccess.thecvf.com
Recent progress on visual question answering has explored the merits of grid features for
vision language tasks. Meanwhile, transformer-based models have shown remarkable …

Attention on attention for image captioning

L Huang, W Wang, J Chen… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Attention mechanisms are widely used in current encoder/decoder frameworks of image
captioning, where a weighted average on encoded vectors is generated at each time step to …

Experience grounds language

Y Bisk, A Holtzman, J Thomason, J Andreas… - arXiv preprint arXiv …, 2020 - arxiv.org
Language understanding research is held back by a failure to relate language to the
physical world it describes and to the social interactions it facilitates. Despite the incredible …

Auto-encoding scene graphs for image captioning

X Yang, K Tang, H Zhang, J Cai - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Abstract We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language
inductive bias into the encoder-decoder image captioning framework for more human-like …

Multimodal machine learning: A survey and taxonomy

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

Multimodal transformer with multi-view visual representation for image captioning

J Yu, J Li, Z Yu, Q Huang - … on circuits and systems for video …, 2019 - ieeexplore.ieee.org
Image captioning aims to automatically generate a natural language description of a given
image, and most state-of-the-art models have adopted an encoder-decoder framework. The …

Knowing when to look: Adaptive attention via a visual sentinel for image captioning

J Lu, C Xiong, D Parikh… - Proceedings of the IEEE …, 2017 - openaccess.thecvf.com
Attention-based neural encoder-decoder frameworks have been widely adopted for image
captioning. Most methods force visual attention to be active for every generated word …

Survey of the state of the art in natural language generation: Core tasks, applications and evaluation

A Gatt, E Krahmer - Journal of Artificial Intelligence Research, 2018 - jair.org
This paper surveys the current state of the art in Natural Language Generation (NLG),
defined as the task of generating text or speech from non-linguistic input. A survey of NLG is …