Lavis: A library for language-vision intelligence

D Li, J Li, H Le, G Wang, S Savarese… - arXiv preprint arXiv …, 2022 - arxiv.org
We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …

Enabling harmonious human-machine interaction with visual-context augmented dialogue system: A review

H Wang, B Guo, Y Zeng, Y Ding, C Qiu, Y Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org
The intelligent dialogue system, aiming at communicating with humans harmoniously with
natural language, is brilliant for promoting the advancement of human-machine interaction …

HEAR: Hearing enhanced audio response for video-grounded dialogue

S Yoon, D Kim, E Yoon, HS Yoon, J Kim… - arXiv preprint arXiv …, 2023 - arxiv.org
Video-grounded Dialogue (VGD) aims to answer questions regarding a given multi-modal
input comprising video, audio, and dialogue history. Although there have been numerous …

Multimodal dialogue state tracking

H Le, NF Chen, SCH Hoi - arXiv preprint arXiv:2206.07898, 2022 - arxiv.org
Designed for tracking user goals in dialogues, a dialogue state tracker is an essential
component in a dialogue system. However, the research of dialogue state tracking has …

Multi-modal Video Dialog State Tracking in the Wild

A Abdessaied, L Shi, A Bulling - European Conference on Computer …, 2025 - Springer
Abstract We present\(\mathbb {MST} _\mathbb {MIXER}\)–a novel video dialog model
operating over a generic multi-modal state tracking scheme. Current models that claim to …

Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation

X Zhao, Y Wang, C Tao, C Wang, D Zhao - arXiv preprint arXiv …, 2022 - arxiv.org
We study video-grounded dialogue generation, where a response is generated based on
the dialogue context and the associated video. The primary challenges of this task lie in (1) …

Video dialog as conversation about objects living in space-time

HA Pham, TM Le, V Le, TM Phuong, T Tran - European Conference on …, 2022 - Springer
It would be a technological feat to be able to create a system that can hold a meaningful
conversation with humans about what they watch. A setup toward that goal is presented as a …

Uncovering Hidden Connections: Iterative Tracking and Reasoning for Video-grounded Dialog

H Zhang, M Liu, Y Wang, D Cao, W Guan… - arXiv preprint arXiv …, 2023 - arxiv.org
In contrast to conventional visual question answering, video-grounded dialog necessitates a
profound understanding of both dialog history and video content for accurate response …

Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues

H Wang, B Guo, M Chen, Q Zhang, Y Ding… - Frontiers of Computer …, 2025 - Springer
Abstract Video-Grounded Dialogue System (VGDS), focusing on generating reasonable
responses based on multi-turn dialogue contexts and a given video, has received intensive …

OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog

A Abdessaied, M von Hochmeister, A Bulling - arXiv preprint arXiv …, 2024 - arxiv.org
We present the Object Language Video Transformer (OLViT)-a novel model for video dialog
operating over a multi-modal attention-based dialog state tracker. Existing video dialog …