Recent advances in deep learning based dialogue systems: A systematic survey

J Ni, T Young, V Pandelea, F Xue… - Artificial intelligence review, 2023 - Springer
Dialogue systems are a popular natural language processing (NLP) task as it is promising in
real-life applications. It is also a complicated task since many NLP tasks deserving study are …

Retrieving multimodal information for augmented generation: A survey

R Zhao, H Chen, W Wang, F Jiao, XL Do, C Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
As Large Language Models (LLMs) become popular, there emerged an important trend of
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …

Meta-gui: Towards multi-modal conversational agents on mobile gui

L Sun, X Chen, L Chen, T Dai, Z Zhu, K Yu - arXiv preprint arXiv …, 2022 - arxiv.org
Task-oriented dialogue (TOD) systems have been widely used by mobile phone intelligent
assistants to accomplish tasks such as calendar scheduling or hotel reservation. Current …

Vgnmn: Video-grounded neural module networks for video-grounded dialogue systems

H Le, N Chen, S Hoi - Proceedings of the 2022 Conference of the …, 2022 - aclanthology.org
Neural module networks (NMN) have achieved success in image-grounded tasks such as
Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN …

Enabling harmonious human-machine interaction with visual-context augmented dialogue system: A review

H Wang, B Guo, Y Zeng, Y Ding, C Qiu, Y Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org
The intelligent dialogue system, aiming at communicating with humans harmoniously with
natural language, is brilliant for promoting the advancement of human-machine interaction …

Learning to retrieve videos by asking questions

A Madasu, J Oliva, G Bertasius - Proceedings of the 30th ACM …, 2022 - dl.acm.org
The majority of traditional text-to-video retrieval systems operate in static environments, ie,
there is no interaction between the user and the agent beyond the initial textual query …

Geometric features informed multi-person human-object interaction recognition in videos

T Qiao, Q Men, FWB Li, Y Kubotani… - … on Computer Vision, 2022 - Springer
Abstract Human-Object Interaction (HOI) recognition in videos is important for analyzing
human activity. Most existing work focusing on visual features usually suffer from occlusion …

DVD: A diagnostic dataset for multi-step reasoning in video grounded dialogue

H Le, C Sankar, S Moon, A Beirami… - arXiv preprint arXiv …, 2021 - arxiv.org
A video-grounded dialogue system is required to understand both dialogue, which contains
semantic dependencies from turn to turn, and video, which contains visual cues of spatial …

Maintaining common ground in dynamic environments

T Udagawa, A Aizawa - Transactions of the Association for …, 2021 - direct.mit.edu
Common grounding is the process of creating and maintaining mutual understandings,
which is a critical aspect of sophisticated human communication. While various task settings …

Multimodal dialogue state tracking

H Le, NF Chen, SCH Hoi - arXiv preprint arXiv:2206.07898, 2022 - arxiv.org
Designed for tracking user goals in dialogues, a dialogue state tracker is an essential
component in a dialogue system. However, the research of dialogue state tracking has …