- 学术资源搜索

Hierarchical semantic correspondence networks for video paragraph grounding

C Tan, Z Lin, JF Hu, WS Zheng… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Video Paragraph Grounding (VPG) is an essential yet challenging task in vision-
language understanding, which aims to jointly localize multiple events from an untrimmed …

被引用次数：22 相关文章所有 6 个版本

[PDF] arxiv.org

Arondight: Red teaming large vision language models with auto-generated multi-modal jailbreak prompts

Y Liu, C Cai, X Zhang, X Yuan, C Wang - Proceedings of the 32nd ACM …, 2024 - dl.acm.org

Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of
Large Language Models (LLMs). Despite offering new possibilities for LLM applications …

被引用次数：6 相关文章所有 5 个版本

[PDF] aclanthology.org

Rethinking Multimodal Entity and Relation Extraction from a Translation Point of View

C Zheng, J Feng, Y Cai, X Wei, Q Li - Proceedings of the 61st …, 2023 - aclanthology.org

We revisit the multimodal entity and relation extraction from a translation point of view.
Special attention is paid on the misalignment issue in text-image datasets which may …

被引用次数：16 相关文章所有 3 个版本

[PDF] arxiv.org

Vaquita: Enhancing alignment in llm-assisted video understanding

Y Wang, R Zhang, H Wang, U Bhattacharya… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent advancements in language-model-based video understanding have been
progressing at a remarkable pace, spurred by the introduction of Large Language Models …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

A multi-modal context reasoning approach for conditional inference on joint textual and visual clues

Y Li, B Hu, X Chen, Y Ding, L Ma, M Zhang - arXiv preprint arXiv …, 2023 - arxiv.org

Conditional inference on joint textual and visual clues is a multi-modal reasoning task that
textual clues provide prior permutation or external knowledge, which are complementary …

被引用次数：14 相关文章所有 5 个版本

[PDF] arxiv.org

Weakly-supervised learning of visual relations in multimodal pretraining

E Bugliarello, A Nematzadeh, LA Hendricks - arXiv preprint arXiv …, 2023 - arxiv.org

Recent work in vision-and-language pretraining has investigated supervised signals from
object detection data to learn better, fine-grained multimodal representations. In this work …

被引用次数：6 相关文章所有 4 个版本

[PDF] arxiv.org

EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models

M Du, B Wu, Z Li, X Huang, Z Wei - arXiv preprint arXiv:2406.05756, 2024 - arxiv.org

The recent rapid development of Large Vision-Language Models (LVLMs) has indicated
their potential for embodied tasks. However, the critical skill of spatial understanding in …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

M2conceptbase: A fine-grained aligned multi-modal conceptual knowledge base

Z Zha, J Wang, Z Li, X Zhu, W Song, Y Xiao - arXiv preprint arXiv …, 2023 - arxiv.org

Large multi-modal models (LMMs) have demonstrated promising intelligence owing to the
rapid development of pre-training techniques. However, their fine-grained cross-modal …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning

M Du, B Wu, J Zhang, Z Fan, Z Li, R Luo… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision-and-Language navigation (VLN) requires an agent to navigate in unseen
environment by following natural language instruction. For task completion, the agent needs …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

ConcEPT: Concept-Enhanced Pre-Training for Language Models

X Wang, Z Gu, J Liang, D Lu, Y Xiao… - arXiv preprint arXiv …, 2024 - arxiv.org

Pre-trained language models (PLMs) have been prevailing in state-of-the-art methods for
natural language processing, and knowledge-enhanced PLMs are further proposed to …

被引用次数：2 相关文章所有 2 个版本

高级搜索

QQ 群

Hierarchical semantic correspondence networks for video paragraph grounding

Arondight: Red teaming large vision language models with auto-generated multi-modal jailbreak prompts

Rethinking Multimodal Entity and Relation Extraction from a Translation Point of View

Vaquita: Enhancing alignment in llm-assisted video understanding

A multi-modal context reasoning approach for conditional inference on joint textual and visual clues

Weakly-supervised learning of visual relations in multimodal pretraining

EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models

M2conceptbase: A fine-grained aligned multi-modal conceptual knowledge base

DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning

ConcEPT: Concept-Enhanced Pre-Training for Language Models

引用