Clue: Cross-modal coherence modeling for caption generation

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

被引用次数：298 相关文章所有 11 个版本

[PDF] sciencedirect.com

Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier

Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

被引用次数：88 相关文章所有 5 个版本

[PDF] arxiv.org

All you may need for vqa are image captions

S Changpinyo, D Kukliansky, I Szpektor… - arXiv preprint arXiv …, 2022 - arxiv.org

Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but
has not enjoyed the same level of engagement in terms of data creation. In this paper, we …

被引用次数：59 相关文章所有 7 个版本

[PDF] thecvf.com

Human-like controllable image captioning with verb-specific semantic roles

L Chen, Z Jiang, J Xiao, W Liu - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

Abstract Controllable Image Captioning (CIC)--generating image descriptions following
designated control signals--has received unprecedented attention over the last few years …

被引用次数：75 相关文章所有 5 个版本

[PDF] jair.org Full View

Neural natural language generation: A survey on multilinguality, multimodality, controllability and learning

E Erdem, M Kuyu, S Yagcioglu, A Frank… - Journal of Artificial …, 2022 - jair.org

Developing artificial learning systems that can understand and generate natural language
has been one of the long-standing goals of artificial intelligence. Recent decades have …

被引用次数：45 相关文章所有 20 个版本

[PDF] arxiv.org

Underspecification in scene description-to-depiction tasks

B Hutchinson, J Baldridge, V Prabhakaran - arXiv preprint arXiv …, 2022 - arxiv.org

Questions regarding implicitness, ambiguity and underspecification are crucial for
understanding the task validity and ethical concerns of multimodal image+ text systems, yet …

被引用次数：29 相关文章所有 5 个版本

[PDF] arxiv.org

Memecap: A dataset for captioning and interpreting memes

EJ Hwang, V Shwartz - arXiv preprint arXiv:2305.13703, 2023 - arxiv.org

Memes are a widely popular tool for web users to express their thoughts using visual
metaphors. Understanding memes requires recognizing and interpreting visual metaphors …

被引用次数：20 相关文章所有 4 个版本

[PDF] arxiv.org

Crossmodal-3600: A massively multilingual multimodal evaluation dataset

AV Thapliyal, J Pont-Tuset, X Chen… - arXiv preprint arXiv …, 2022 - arxiv.org

Research in massively multilingual image captioning has been severely hampered by a lack
of high-quality evaluation datasets. In this paper we present the Crossmodal-3600 dataset …

被引用次数：41 相关文章所有 6 个版本

Visual language navigation: A survey and open challenges

SM Park, YG Kim - Artificial Intelligence Review, 2023 - Springer

With the recent development of deep learning, AI models are widely used in various
domains. AI models show good performance for definite tasks such as image classification …

被引用次数：19 相关文章所有 5 个版本

[PDF] arxiv.org

Preserving semantic neighborhoods for robust cross-modal retrieval

C Thomas, A Kovashka - Computer Vision–ECCV 2020: 16th European …, 2020 - Springer

The abundance of multimodal data (eg social media posts) has inspired interest in cross-
modal retrieval methods. Popular approaches rely on a variety of metric learning losses …

被引用次数：41 相关文章所有 8 个版本

高级搜索

QQ 群