A survey on knowledge-enhanced multimodal learning

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

被引用次数：44 相关文章所有 2 个版本

[PDF] arxiv.org

Benchmark evaluations, applications, and challenges of large vision language models: A survey

Z Li, X Wu, H Du, H Nghiem, G Shi - arXiv preprint arXiv:2501.02189, 2025 - arxiv.org

Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

An overview of temporal commonsense reasoning and acquisition

G Wenzel, A Jatowt - arXiv preprint arXiv:2308.00002, 2023 - arxiv.org

Temporal commonsense reasoning refers to the ability to understand the typical temporal
context of phrases, actions, and events, and use it to reason over problems requiring such …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation

A Kritharoula, M Lymperaiou, G Stamou - arXiv preprint arXiv:2310.14025, 2023 - arxiv.org

Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of
retrieving an image among a set of candidates, which better represents the meaning of an …

被引用次数：9 相关文章所有 5 个版本

[PDF] arxiv.org

A Survey on Image-text Multimodal Models

R Guo, J Wei, L Sun, B Yu, G Chang, D Liu… - arXiv preprint arXiv …, 2023 - arxiv.org

Amidst the evolving landscape of artificial intelligence, the convergence of visual and textual
information has surfaced as a crucial frontier, leading to the advent of image-text multimodal …

被引用次数：5 相关文章所有 2 个版本

[PDF] ieee.org

Capturing the Concept Projection in Metaphorical Memes for Downstream Learning Tasks

S Acharya, B Das, TSB Sudarshan - IEEE Access, 2023 - ieeexplore.ieee.org

Metaphorical memes, where a source concept is projected into a target concept, are an
essential construct in figurative language. In this article, we present a novel approach for …

Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes

R Garg, T Padhi, H Jain, U Kursuncu… - arXiv preprint arXiv …, 2024 - arxiv.org

Toxicity identification in online multimodal environments remains a challenging task due to
the complexity of contextual connections across modalities (eg, textual and visual). In this …

Knowledge-based counterfactual queries for visual question answering

T Stoikou, M Lymperaiou, G Stamou - arXiv preprint arXiv:2303.02601, 2023 - arxiv.org

Visual Question Answering (VQA) has been a popular task that combines vision and
language, with numerous relevant implementations in literature. Even though there are …

被引用次数：2 相关文章所有 2 个版本

Structured Intention Generation with Multimodal Graph Transformers: The MMIntent-LLM Framework

B Song, X Fan, Q Jia, R Xin… - 2024 IEEE International …, 2024 - ieeexplore.ieee.org

In the task of answering questions related to electricity knowledge, accurately understanding
user intentions is fundamental to building an effective reasoning process. Questions in this …

[PDF][PDF] Αυτόματη παραγωγή εικόνων μόδας με χρήση προτροπής σε γενετικά μοντέλα μηχανικής μάθησης

Γ Αργυρού - 2024 - dspace.lib.ntua.gr

Περίληψη Στο σύγχρονο τοπίο της μόδας, η σύγκλιση τεχνολογίας και δημιουργικότητας έχει
δημιουργήσει νέες ευκαιρίες και αναδρομολογήσει τα πρότυπα της βιομηχανίας. Στο …

高级搜索

QQ 群