Voice activity detection by upper body motion analysis and unsupervised domain adaptation

T Afouras, A Owens, JS Chung, A Zisserman - Computer Vision–ECCV …, 2020 - Springer

Our objective is to transform a video into a set of discrete audio-visual objects using self-
supervised learning. To this end, we introduce a model that uses attention to localize and …

被引用次数：267 相关文章所有 8 个版本

[PDF] ieee.org

Active Speaker Detection using Audio, Visual and Depth Modalities: A Survey

SNAM Robi, MAZM Ariffin, MAM Izhar, N Ahmad… - IEEE …, 2024 - ieeexplore.ieee.org

The rapid progress of multimodal signal processing in recent years has cleared the way for
novel applications in human-computer interaction, surveillance, and telecommunication …

被引用次数：1 相关文章所有 2 个版本

[PDF] thecvf.com

S-VVAD: Visual voice activity detection by motion segmentation

M Shahid, C Beyan, V Murino - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We address the challenging Voice Activity Detection (VAD) problem, which determines" Who
is Speaking and When?" in audiovisual recordings. The typical audio-based VAD systems …

被引用次数：28 相关文章所有 7 个版本

[PDF] arxiv.org

Unicon: Unified context network for robust active speaker detection

Y Zhang, S Liang, S Yang, X Liu, Z Wu, S Shan… - Proceedings of the 29th …, 2021 - dl.acm.org

We propose a new efficient framework, the Unified Context Network (UniCon), for robust
active speaker detection (ASD). Traditional methods for ASD usually operate on each …

被引用次数：36 相关文章所有 7 个版本

[PDF] arxiv.org

REWIND Dataset: Privacy-preserving Speaking Status Segmentation from Multimodal Body Movement Signals in the Wild

JV Quiros, C Raman, S Tan, E Gedik… - arXiv preprint arXiv …, 2024 - arxiv.org

Recognizing speaking in humans is a central task towards understanding social
interactions. Ideally, speaking would be detected from individual voice recordings, as done …

被引用次数：2 相关文章

[PDF] arxiv.org

Backchannel Detection and Agreement Estimation from Video with Transformer Networks

A Amer, C Bhuvaneshwara, GK Addluri… - … Joint Conference on …, 2023 - ieeexplore.ieee.org

Listeners use short interjections, so-called backchannels, to signify attention or express
agreement. The automatic analysis of this behavior is of key importance for human …

被引用次数：7 相关文章所有 4 个版本

[PDF] perceptualui.org

Conan: A usable tool for multimodal conversation analysis

A Penzkofer, P Müller, F Bühler, S Mayer… - Proceedings of the 2021 …, 2021 - dl.acm.org

Multimodal analysis of group behavior is a key task in human-computer interaction, and in
the social and behavioral sciences, but is often limited to more easily controllable laboratory …

被引用次数：11 相关文章所有 7 个版本

[PDF] archive.org

Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings

C Yang, M Chen, Y Wang, Y Wang - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Audio-visual speaker diarization refers to the task of identifying" who spoke when" by using
both audio and video data. Although previous fusion-based approaches have shown …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

No-audio speaking status detection in crowded settings via visual pose-based filtering and wearable acceleration

J Vargas-Quiros, L Cabrera-Quiros, H Hung - arXiv preprint arXiv …, 2022 - arxiv.org

Recognizing who is speaking in a crowded scene is a key challenge towards the
understanding of the social interactions going on within. Detecting speaking status from …

被引用次数：1 相关文章所有 2 个版本

[PDF] neurips.cc

ConfLab: a data collection concept, dataset, and benchmark for machine analysis of free-standing social interactions in the wild

C Raman, J Vargas Quiros, S Tan… - Advances in …, 2022 - proceedings.neurips.cc

Recording the dynamics of unscripted human interactions in the wild is challenging due to
the delicate trade-offs between several factors: participant privacy, ecological validity, data …

被引用次数：6 相关文章所有 6 个版本

高级搜索

QQ 群