BiST: Bi-directional spatio-temporal reasoning for video-grounded dialogues

J Ni, T Young, V Pandelea, F Xue… - Artificial intelligence review, 2023 - Springer

Dialogue systems are a popular natural language processing (NLP) task as it is promising in
real-life applications. It is also a complicated task since many NLP tasks deserving study are …

被引用次数：293 相关文章所有 15 个版本

[PDF] arxiv.org

Retrieving multimodal information for augmented generation: A survey

R Zhao, H Chen, W Wang, F Jiao, XL Do, C Qin… - arXiv preprint arXiv …, 2023 - arxiv.org

As Large Language Models (LLMs) become popular, there emerged an important trend of
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …

被引用次数：55 相关文章所有 5 个版本

[PDF] arxiv.org

Meta-gui: Towards multi-modal conversational agents on mobile gui

L Sun, X Chen, L Chen, T Dai, Z Zhu, K Yu - arXiv preprint arXiv …, 2022 - arxiv.org

Task-oriented dialogue (TOD) systems have been widely used by mobile phone intelligent
assistants to accomplish tasks such as calendar scheduling or hotel reservation. Current …

被引用次数：52 相关文章所有 3 个版本

[PDF] aclanthology.org

Vgnmn: Video-grounded neural module networks for video-grounded dialogue systems

H Le, N Chen, S Hoi - Proceedings of the 2022 Conference of the …, 2022 - aclanthology.org

Neural module networks (NMN) have achieved success in image-grounded tasks such as
Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN …

被引用次数：23 相关文章所有 2 个版本

[PDF] arxiv.org

Enabling harmonious human-machine interaction with visual-context augmented dialogue system: A review

H Wang, B Guo, Y Zeng, Y Ding, C Qiu, Y Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org

The intelligent dialogue system, aiming at communicating with humans harmoniously with
natural language, is brilliant for promoting the advancement of human-machine interaction …

被引用次数：3 相关文章所有 2 个版本

[PDF] acm.org

Learning to retrieve videos by asking questions

A Madasu, J Oliva, G Bertasius - Proceedings of the 30th ACM …, 2022 - dl.acm.org

The majority of traditional text-to-video retrieval systems operate in static environments, ie,
there is no interaction between the user and the agent beyond the initial textual query …

被引用次数：17 相关文章所有 4 个版本

[PDF] arxiv.org

Geometric features informed multi-person human-object interaction recognition in videos

T Qiao, Q Men, FWB Li, Y Kubotani… - … on Computer Vision, 2022 - Springer

Abstract Human-Object Interaction (HOI) recognition in videos is important for analyzing
human activity. Most existing work focusing on visual features usually suffer from occlusion …

被引用次数：11 相关文章所有 11 个版本

[PDF] arxiv.org

DVD: A diagnostic dataset for multi-step reasoning in video grounded dialogue

H Le, C Sankar, S Moon, A Beirami… - arXiv preprint arXiv …, 2021 - arxiv.org

A video-grounded dialogue system is required to understand both dialogue, which contains
semantic dependencies from turn to turn, and video, which contains visual cues of spatial …

被引用次数：18 相关文章所有 7 个版本

[PDF] mit.edu

Maintaining common ground in dynamic environments

T Udagawa, A Aizawa - Transactions of the Association for …, 2021 - direct.mit.edu

Common grounding is the process of creating and maintaining mutual understandings,
which is a critical aspect of sophisticated human communication. While various task settings …

被引用次数：16 相关文章所有 9 个版本

[PDF] arxiv.org

Multimodal dialogue state tracking

H Le, NF Chen, SCH Hoi - arXiv preprint arXiv:2206.07898, 2022 - arxiv.org

Designed for tracking user goals in dialogues, a dialogue state tracker is an essential
component in a dialogue system. However, the research of dialogue state tracking has …

被引用次数：13 相关文章所有 5 个版本

高级搜索

QQ 群