From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

A review on methods and applications in multimodal deep learning

S Jabeen, X Li, MS Amin, O Bourahla, S Li… - ACM Transactions on …, 2023 - dl.acm.org
Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …

Meshed-memory transformer for image captioning

M Cornia, M Stefanini, L Baraldi… - Proceedings of the …, 2020 - openaccess.thecvf.com
Transformer-based architectures represent the state of the art in sequence modeling tasks
like machine translation and language understanding. Their applicability to multi-modal …

Rstnet: Captioning with adaptive attention on visual and non-visual words

X Zhang, X Sun, Y Luo, J Ji, Y Zhou… - Proceedings of the …, 2021 - openaccess.thecvf.com
Recent progress on visual question answering has explored the merits of grid features for
vision language tasks. Meanwhile, transformer-based models have shown remarkable …

X-linear attention networks for image captioning

Y Pan, T Yao, Y Li, T Mei - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com
Recent progress on fine-grained visual recognition and visual question answering has
featured Bilinear Pooling, which effectively models the 2nd order interactions across multi …

Attention on attention for image captioning

L Huang, W Wang, J Chen… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Attention mechanisms are widely used in current encoder/decoder frameworks of image
captioning, where a weighted average on encoded vectors is generated at each time step to …

Task-adaptive attention for image captioning

C Yan, Y Hao, L Li, J Yin, A Liu, Z Mao… - … on Circuits and …, 2021 - ieeexplore.ieee.org
Attention mechanisms are now widely used in image captioning models. However, most
attention models only focus on visual features. When generating syntax related words, little …

Semantic-conditional diffusion networks for image captioning

J Luo, Y Li, Y Pan, T Yao, J Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recent advances on text-to-image generation have witnessed the rise of diffusion models
which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent …

Injecting semantic concepts into end-to-end image captioning

Z Fang, J Wang, X Hu, L Liang, Z Gan… - Proceedings of the …, 2022 - openaccess.thecvf.com
Tremendous progress has been made in recent years in developing better image captioning
models, yet most of them rely on a separate object detector to extract regional features …

Multimodal transformer with multi-view visual representation for image captioning

J Yu, J Li, Z Yu, Q Huang - … on circuits and systems for video …, 2019 - ieeexplore.ieee.org
Image captioning aims to automatically generate a natural language description of a given
image, and most state-of-the-art models have adopted an encoder-decoder framework. The …