Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

CAD-contextual multi-modal alignment for dynamic AVQA

A Nadeem, A Hilton, R Dawes… - Proceedings of the …, 2024 - openaccess.thecvf.com
In the context of Audio Visual Question Answering (AVQA) tasks, the audio and visual
modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing …

Modality-independent teachers meet weakly-supervised audio-visual event parser

YH Lai, YC Chen, F Wang - Advances in Neural Information …, 2023 - proceedings.neurips.cc
Audio-visual learning has been a major pillar of multi-modal machine learning, where the
community mostly focused on its $\textit {modality-aligned} $ setting, $\textit {ie} $, the audio …

Question-aware global-local video understanding network for audio-visual question answering

Z Chen, L Wang, P Wang, P Gao - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
As a newly emerging task, audio-visual question answering (AVQA) has attracted research
attention. Compared with traditional single-modality (eg, audio or visual) QA tasks, it poses …

Dialogmcf: Multimodal context flow for audio visual scene-aware dialog

Z Chen, H Liu, Y Wang - IEEE/ACM Transactions on Audio …, 2023 - ieeexplore.ieee.org
In recent years, Audio Visual Scene-Aware Dialog (AVSD) has been an active research task
in the multimodal dialogue community and has also been a core part of the Dialog System …

Multi-modal Video Dialog State Tracking in the Wild

A Abdessaied, L Shi, A Bulling - European Conference on Computer …, 2025 - Springer
Abstract We present\(\mathbb {MST} _\mathbb {MIXER}\)–a novel video dialog model
operating over a generic multi-modal state tracking scheme. Current models that claim to …

[PDF][PDF] Investigation on transformer-based multi-modal fusion for audio-visual scene-aware dialog

X Huang, HL Tan, MC Leong, Y Sun, L Li… - Proc. DSTC10 …, 2022 - oar.a-star.edu.sg
In this report, we present our submissions to the DSTC10 Audio Visual Scene Dialog
(AVSD) challenge. We investigated variants of an encoder-decoder model, including those …

Audio visual scene-aware dialog generation with transformer-based video representations

Y Yamazaki, S Orihashi, R Masumura… - arXiv preprint arXiv …, 2022 - arxiv.org
There have been many attempts to build multimodal dialog systems that can respond to a
question about given audio-visual information, and the representative task for such systems …

Grounding is All You Need? Dual Temporal Grounding for Video Dialog

Y Qin, W Ji, X Lan, H Fei, X Yang, D Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
In the realm of video dialog response generation, the understanding of video content and
the temporal nuances of conversation history are paramount. While a segment of current …

MSG-BART: Multi-Granularity Scene Graph-Enhanced Encoder-Decoder Language Model for Video-Grounded Dialogue Generation

H Liu, Z Chen, H Li, P Wang, Y Wang… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Generating dialogue grounded in videos requires a high level of understanding and
reasoning about the visual scenes in the videos. However, existing large visual-language …