Transform and tell: Entity-aware news image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

被引用次数：298 相关文章所有 11 个版本

[PDF] sciencedirect.com

Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier

Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

被引用次数：88 相关文章所有 5 个版本

[PDF] thecvf.com

HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

T Guan, F Liu, X Wu, R Xian, Z Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce" HallusionBench" a comprehensive benchmark designed for the evaluation of
image-context reasoning. This benchmark presents significant challenges to advanced large …

被引用次数：35 相关文章所有 3 个版本

[PDF] arxiv.org

Multi-modal knowledge graph construction and application: A survey

X Zhu, Z Li, X Wang, X Jiang, P Sun… - … on Knowledge and …, 2022 - ieeexplore.ieee.org

Recent years have witnessed the resurgence of knowledge engineering which is featured
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …

被引用次数：139 相关文章所有 7 个版本

[PDF] arxiv.org

Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi …

F Liu, T Guan, Z Li, L Chen, Y Yacoob… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs), after being aligned with vision models and integrated into
vision-language models (VLMs), can bring impressive improvement in image reasoning …

被引用次数：69 相关文章所有 2 个版本

[PDF] arxiv.org

Exploiting BERT for multimodal target sentiment classification through input space translation

Z Khan, Y Fu - Proceedings of the 29th ACM international conference …, 2021 - dl.acm.org

Multimodal target/aspect sentiment classification combines multimodal sentiment analysis
and aspect/target sentiment classification. The goal of the task is to combine vision and …

被引用次数：98 相关文章所有 4 个版本

[PDF] thecvf.com

Ei-clip: Entity-aware interventional contrastive learning for e-commerce cross-modal retrieval

H Ma, H Zhao, Z Lin, A Kale, Z Wang… - Proceedings of the …, 2022 - openaccess.thecvf.com

Abstract recommendation, and marketing services. Extensive efforts have been made to
conquer the cross-modal retrieval problem in the general domain. When it comes to E …

被引用次数：46 相关文章所有 3 个版本

[PDF] arxiv.org

Visual news: Benchmark and challenges in news image captioning

F Liu, Y Wang, T Wang, V Ordonez - arXiv preprint arXiv:2010.03743, 2020 - arxiv.org

We propose Visual News Captioner, an entity-aware model for the task of news image
captioning. We also introduce Visual News, a large-scale benchmark consisting of more …

被引用次数：95 相关文章所有 4 个版本

[PDF] arxiv.org

Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

被引用次数：18 相关文章所有 2 个版本

[PDF] thecvf.com

Let there be a clock on the beach: Reducing object hallucination in image captioning

AF Biten, L Gómez, D Karatzas - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Explaining an image with missing or non-existent objects is known as object bias
(hallucination) in image captioning. This behaviour is quite common in the state-of-the-art …

被引用次数：45 相关文章所有 5 个版本

高级搜索

QQ 群