Normalized and geometry-aware self-attention network for image captioning

C Zhang, C Zhang, S Zheng, Y Qiao, C Li… - arXiv preprint arXiv …, 2023 - arxiv.org

As ChatGPT goes viral, generative AI (AIGC, aka AI-generated content) has made headlines
everywhere because of its ability to analyze and create text, images, and beyond. With such …

被引用次数：206 相关文章所有 4 个版本

[PDF] arxiv.org

From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

被引用次数：394 相关文章所有 11 个版本

[PDF] springer.com

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

K Bayoudh, R Knani, F Hamdaoui, A Mtibaa - The Visual Computer, 2022 - Springer

The research progress in multimodal learning has grown rapidly over the last decade in
several areas, especially in computer vision. The growing potential of multimodal data …

被引用次数：352 相关文章所有 7 个版本

[PDF] arxiv.org

Cross-attention of disentangled modalities for 3d human mesh recovery with transformers

J Cho, K Youwang, TH Oh - European Conference on Computer Vision, 2022 - Springer

Transformer encoder architectures have recently achieved state-of-the-art results on
monocular 3D human mesh reconstruction, but they require a substantial number of …

被引用次数：135 相关文章所有 4 个版本

[PDF] aaai.org

Dual-level collaborative transformer for image captioning

Y Luo, J Ji, X Sun, L Cao, Y Wu, F Huang… - Proceedings of the …, 2021 - ojs.aaai.org

Descriptive region features extracted by object detection networks have played an important
role in the recent advancements of image captioning. However, they are still criticized for the …

被引用次数：324 相关文章所有 6 个版本

[PDF] thecvf.com

Rstnet: Captioning with adaptive attention on visual and non-visual words

X Zhang, X Sun, Y Luo, J Ji, Y Zhou… - Proceedings of the …, 2021 - openaccess.thecvf.com

Recent progress on visual question answering has explored the merits of grid features for
vision language tasks. Meanwhile, transformer-based models have shown remarkable …

被引用次数：259 相关文章所有 5 个版本

[PDF] thecvf.com

Comprehending and ordering semantics for image captioning

Y Li, Y Pan, T Yao, T Mei - … of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com

Comprehending the rich semantics in an image and ordering them in linguistic order are
essential to compose a visually-grounded and linguistically coherent description for image …

被引用次数：117 相关文章所有 5 个版本

[PDF] thecvf.com

Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic

Y Tewel, Y Shalev, I Schwartz… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Recent text-to-image matching models apply contrastive learning to large corpora of
uncurated pairs of images and sentences. While such models can provide a powerful score …

被引用次数：163 相关文章所有 6 个版本

[PDF] arxiv.org

Grit: Faster and better image captioning transformer using dual visual features

VQ Nguyen, M Suganuma, T Okatani - European Conference on Computer …, 2022 - Springer

Current state-of-the-art methods for image captioning employ region-based features, as they
provide object-level information that is essential to describe the content of images; they are …

被引用次数：130 相关文章所有 8 个版本

[PDF] thecvf.com

Kiut: Knowledge-injected u-transformer for radiology report generation

Z Huang, X Zhang, S Zhang - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Radiology report generation aims to automatically generate a clinically accurate and
coherent paragraph from the X-ray image, which could relieve radiologists from the heavy …

被引用次数：81 相关文章所有 7 个版本

高级搜索

QQ 群