From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Multimodal intelligence: Representation learning, information fusion, and applications

C Zhang, Z Yang, X He, L Deng - IEEE Journal of Selected …, 2020 - ieeexplore.ieee.org
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …

Diffuseq: Sequence to sequence text generation with diffusion models

S Gong, M Li, J Feng, Z Wu, LP Kong - arXiv preprint arXiv:2210.08933, 2022 - arxiv.org
Recently, diffusion models have emerged as a new paradigm for generative models.
Despite the success in domains using continuous signals such as vision and audio …

Say as you wish: Fine-grained control of image caption generation with abstract scene graphs

S Chen, Q Jin, P Wang, Q Wu - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
Humans are able to describe image contents with coarse to fine details as they wish.
However, most image captioning models are intention-agnostic which cannot generate …

Controllable video captioning with pos sequence guidance based on gated fusion network

B Wang, L Ma, W Zhang, W Jiang… - Proceedings of the …, 2019 - openaccess.thecvf.com
In this paper, we propose to guide the video caption generation with Part-of-Speech (POS)
information, based on a gated fusion of multiple representations of input videos. We …

Syntax-aware action targeting for video captioning

Q Zheng, C Wang, D Tao - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com
Existing methods on video captioning have made great efforts to identify objects/instances in
videos, but few of them emphasize the prediction of action. As a result, the learned models …

Layoutdiffusion: Improving graphic layout generation by discrete diffusion probabilistic models

J Zhang, J Guo, S Sun, JG Lou… - Proceedings of the …, 2023 - openaccess.thecvf.com
Creating graphic layouts is a fundamental step in graphic designs. In this work, we present a
novel generative model named LayoutDiffusion for automatic layout generation. As layout is …

Comprehensive image captioning via scene graph decomposition

Y Zhong, L Wang, J Chen, D Yu, Y Li - … , Glasgow, UK, August 23–28, 2020 …, 2020 - Springer
We address the challenging problem of image captioning by revisiting the representation of
image scene graph. At the core of our method lies the decomposition of a scene graph into a …

Show, control and tell: A framework for generating controllable and grounded captions

M Cornia, L Baraldi… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Current captioning approaches can describe images using black-box architectures whose
behavior is hardly controllable and explainable from the exterior. As an image can be …

Difformer: Empowering diffusion models on the embedding space for text generation

Z Gao, J Guo, X Tan, Y Zhu, F Zhang, J Bian… - arXiv preprint arXiv …, 2022 - arxiv.org
Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio
tasks, and recent works further adapt them to textual data by diffusing on the embedding …