Audio-visual scene-aware dialog and reasoning using audio-visual transformers with joint...

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org

Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

被引用次数：62 相关文章所有 2 个版本

[PDF] thecvf.com

CAD-contextual multi-modal alignment for dynamic AVQA

A Nadeem, A Hilton, R Dawes… - Proceedings of the …, 2024 - openaccess.thecvf.com

In the context of Audio Visual Question Answering (AVQA) tasks, the audio and visual
modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing …

被引用次数：9 相关文章所有 6 个版本

[PDF] neurips.cc

Modality-independent teachers meet weakly-supervised audio-visual event parser

YH Lai, YC Chen, F Wang - Advances in Neural Information …, 2023 - proceedings.neurips.cc

Audio-visual learning has been a major pillar of multi-modal machine learning, where the
community mostly focused on its $\textit {modality-aligned} $ setting, $\textit {ie} $, the audio …

被引用次数：6 相关文章所有 6 个版本

Question-aware global-local video understanding network for audio-visual question answering

Z Chen, L Wang, P Wang, P Gao - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

As a newly emerging task, audio-visual question answering (AVQA) has attracted research
attention. Compared with traditional single-modality (eg, audio or visual) QA tasks, it poses …

被引用次数：8 相关文章所有 3 个版本

Dialogmcf: Multimodal context flow for audio visual scene-aware dialog

Z Chen, H Liu, Y Wang - IEEE/ACM Transactions on Audio …, 2023 - ieeexplore.ieee.org

In recent years, Audio Visual Scene-Aware Dialog (AVSD) has been an active research task
in the multimodal dialogue community and has also been a core part of the Dialog System …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

Multi-modal Video Dialog State Tracking in the Wild

A Abdessaied, L Shi, A Bulling - European Conference on Computer …, 2025 - Springer

Abstract We present$\mathbb {MST} _\mathbb {MIXER}$–a novel video dialog model
operating over a generic multi-modal state tracking scheme. Current models that claim to …

[PDF][PDF] Investigation on transformer-based multi-modal fusion for audio-visual scene-aware dialog

X Huang, HL Tan, MC Leong, Y Sun, L Li… - Proc. DSTC10 …, 2022 - oar.a-star.edu.sg

In this report, we present our submissions to the DSTC10 Audio Visual Scene Dialog
(AVSD) challenge. We investigated variants of an encoder-decoder model, including those …

被引用次数：7 相关文章

[PDF] arxiv.org

Audio visual scene-aware dialog generation with transformer-based video representations

Y Yamazaki, S Orihashi, R Masumura… - arXiv preprint arXiv …, 2022 - arxiv.org

There have been many attempts to build multimodal dialog systems that can respond to a
question about given audio-visual information, and the representative task for such systems …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

Grounding is All You Need? Dual Temporal Grounding for Video Dialog

Y Qin, W Ji, X Lan, H Fei, X Yang, D Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

In the realm of video dialog response generation, the understanding of video content and
the temporal nuances of conversation history are paramount. While a segment of current …

MSG-BART: Multi-Granularity Scene Graph-Enhanced Encoder-Decoder Language Model for Video-Grounded Dialogue Generation

H Liu, Z Chen, H Li, P Wang, Y Wang… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

Generating dialogue grounded in videos requires a high level of understanding and
reasoning about the visual scenes in the videos. However, existing large visual-language …

被引用次数：1 相关文章所有 3 个版本

高级搜索

QQ 群