Multimodal llm enhanced cross-lingual cross-modal retrieval

Y Wang, L Wang, Q Zhou, Z Wang, H Li, G Hua… - Proceedings of the …, 2024 - dl.acm.org
Cross-lingual cross-modal retrieval (CCR) aims to retrieve visually relevant content based
on non-English queries, without relying on human-labeled cross-modal data pairs during …

生成式AI 的大模型提示工程: 方法, 现状与展望

黄峻, 林飞, 杨静, 王兴霞, 倪清桦… - 智能科学与技术 …, 2024 - infocomm-journal.com
大语言模型和视觉语言模型在各领域的应用中展示出巨大潜力, 成为研究热点. 然而, 幻觉,
知识迁移, 与人类意图对齐等问题仍然影响着大模型的性能. 首先, 探讨了提示工程与对齐技术 …

Foundation Models for Video Understanding: A Survey

N Madan, A Møgelmose, R Modi, YS Rawat… - arXiv preprint arXiv …, 2024 - arxiv.org
Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various
video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs …

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Y Qi, H Li, Y Song, X Wu, J Luo - arXiv preprint arXiv:2412.08158, 2024 - arxiv.org
The exploration of various vision-language tasks, such as visual captioning, visual question
answering, and visual commonsense reasoning, is an important area in artificial intelligence …