Recurrent fusion network for image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

被引用次数：293 相关文章所有 11 个版本

[PDF] arxiv.org

A review on methods and applications in multimodal deep learning

S Jabeen, X Li, MS Amin, O Bourahla, S Li… - ACM Transactions on …, 2023 - dl.acm.org

Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …

被引用次数：59 相关文章所有 7 个版本

[PDF] thecvf.com

Meshed-memory transformer for image captioning

M Cornia, M Stefanini, L Baraldi… - Proceedings of the …, 2020 - openaccess.thecvf.com

Transformer-based architectures represent the state of the art in sequence modeling tasks
like machine translation and language understanding. Their applicability to multi-modal …

被引用次数：1015 相关文章所有 13 个版本

[PDF] thecvf.com

Rstnet: Captioning with adaptive attention on visual and non-visual words

X Zhang, X Sun, Y Luo, J Ji, Y Zhou… - Proceedings of the …, 2021 - openaccess.thecvf.com

Recent progress on visual question answering has explored the merits of grid features for
vision language tasks. Meanwhile, transformer-based models have shown remarkable …

被引用次数：205 相关文章所有 5 个版本

[PDF] thecvf.com

X-linear attention networks for image captioning

Y Pan, T Yao, Y Li, T Mei - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com

Recent progress on fine-grained visual recognition and visual question answering has
featured Bilinear Pooling, which effectively models the 2nd order interactions across multi …

被引用次数：604 相关文章所有 8 个版本

[PDF] thecvf.com

Attention on attention for image captioning

L Huang, W Wang, J Chen… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Attention mechanisms are widely used in current encoder/decoder frameworks of image
captioning, where a weighted average on encoded vectors is generated at each time step to …

被引用次数：980 相关文章所有 9 个版本

Task-adaptive attention for image captioning

C Yan, Y Hao, L Li, J Yin, A Liu, Z Mao… - … on Circuits and …, 2021 - ieeexplore.ieee.org

Attention mechanisms are now widely used in image captioning models. However, most
attention models only focus on visual features. When generating syntax related words, little …

被引用次数：222 相关文章所有 2 个版本

[PDF] thecvf.com

Semantic-conditional diffusion networks for image captioning

J Luo, Y Li, Y Pan, T Yao, J Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recent advances on text-to-image generation have witnessed the rise of diffusion models
which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent …

被引用次数：46 相关文章所有 5 个版本

[PDF] thecvf.com

Injecting semantic concepts into end-to-end image captioning

Z Fang, J Wang, X Hu, L Liang, Z Gan… - Proceedings of the …, 2022 - openaccess.thecvf.com

Tremendous progress has been made in recent years in developing better image captioning
models, yet most of them rely on a separate object detector to extract regional features …

被引用次数：92 相关文章所有 9 个版本

[PDF] arxiv.org

Multimodal transformer with multi-view visual representation for image captioning

J Yu, J Li, Z Yu, Q Huang - … on circuits and systems for video …, 2019 - ieeexplore.ieee.org

Image captioning aims to automatically generate a natural language description of a given
image, and most state-of-the-art models have adopted an encoder-decoder framework. The …

被引用次数：394 相关文章所有 5 个版本

高级搜索

QQ 群