Hierarchical semantic correspondence networks for video paragraph grounding

C Tan, Z Lin, JF Hu, WS Zheng… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Video Paragraph Grounding (VPG) is an essential yet challenging task in vision-
language understanding, which aims to jointly localize multiple events from an untrimmed …

Arondight: Red teaming large vision language models with auto-generated multi-modal jailbreak prompts

Y Liu, C Cai, X Zhang, X Yuan, C Wang - Proceedings of the 32nd ACM …, 2024 - dl.acm.org
Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of
Large Language Models (LLMs). Despite offering new possibilities for LLM applications …

Rethinking Multimodal Entity and Relation Extraction from a Translation Point of View

C Zheng, J Feng, Y Cai, X Wei, Q Li - Proceedings of the 61st …, 2023 - aclanthology.org
We revisit the multimodal entity and relation extraction from a translation point of view.
Special attention is paid on the misalignment issue in text-image datasets which may …

Vaquita: Enhancing alignment in llm-assisted video understanding

Y Wang, R Zhang, H Wang, U Bhattacharya… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent advancements in language-model-based video understanding have been
progressing at a remarkable pace, spurred by the introduction of Large Language Models …

A multi-modal context reasoning approach for conditional inference on joint textual and visual clues

Y Li, B Hu, X Chen, Y Ding, L Ma, M Zhang - arXiv preprint arXiv …, 2023 - arxiv.org
Conditional inference on joint textual and visual clues is a multi-modal reasoning task that
textual clues provide prior permutation or external knowledge, which are complementary …

Weakly-supervised learning of visual relations in multimodal pretraining

E Bugliarello, A Nematzadeh, LA Hendricks - arXiv preprint arXiv …, 2023 - arxiv.org
Recent work in vision-and-language pretraining has investigated supervised signals from
object detection data to learn better, fine-grained multimodal representations. In this work …

EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models

M Du, B Wu, Z Li, X Huang, Z Wei - arXiv preprint arXiv:2406.05756, 2024 - arxiv.org
The recent rapid development of Large Vision-Language Models (LVLMs) has indicated
their potential for embodied tasks. However, the critical skill of spatial understanding in …

M2conceptbase: A fine-grained aligned multi-modal conceptual knowledge base

Z Zha, J Wang, Z Li, X Zhu, W Song, Y Xiao - arXiv preprint arXiv …, 2023 - arxiv.org
Large multi-modal models (LMMs) have demonstrated promising intelligence owing to the
rapid development of pre-training techniques. However, their fine-grained cross-modal …

DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning

M Du, B Wu, J Zhang, Z Fan, Z Li, R Luo… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-and-Language navigation (VLN) requires an agent to navigate in unseen
environment by following natural language instruction. For task completion, the agent needs …

ConcEPT: Concept-Enhanced Pre-Training for Language Models

X Wang, Z Gu, J Liang, D Lu, Y Xiao… - arXiv preprint arXiv …, 2024 - arxiv.org
Pre-trained language models (PLMs) have been prevailing in state-of-the-art methods for
natural language processing, and knowledge-enhanced PLMs are further proposed to …