Self-supervised learning of audio-visual objects from video

T Afouras, A Owens, JS Chung, A Zisserman - Computer Vision–ECCV …, 2020 - Springer
Our objective is to transform a video into a set of discrete audio-visual objects using self-
supervised learning. To this end, we introduce a model that uses attention to localize and …

Active Speaker Detection using Audio, Visual and Depth Modalities: A Survey

SNAM Robi, MAZM Ariffin, MAM Izhar, N Ahmad… - IEEE …, 2024 - ieeexplore.ieee.org
The rapid progress of multimodal signal processing in recent years has cleared the way for
novel applications in human-computer interaction, surveillance, and telecommunication …

S-VVAD: Visual voice activity detection by motion segmentation

M Shahid, C Beyan, V Murino - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We address the challenging Voice Activity Detection (VAD) problem, which determines" Who
is Speaking and When?" in audiovisual recordings. The typical audio-based VAD systems …

Unicon: Unified context network for robust active speaker detection

Y Zhang, S Liang, S Yang, X Liu, Z Wu, S Shan… - Proceedings of the 29th …, 2021 - dl.acm.org
We propose a new efficient framework, the Unified Context Network (UniCon), for robust
active speaker detection (ASD). Traditional methods for ASD usually operate on each …

REWIND Dataset: Privacy-preserving Speaking Status Segmentation from Multimodal Body Movement Signals in the Wild

JV Quiros, C Raman, S Tan, E Gedik… - arXiv preprint arXiv …, 2024 - arxiv.org
Recognizing speaking in humans is a central task towards understanding social
interactions. Ideally, speaking would be detected from individual voice recordings, as done …

Backchannel Detection and Agreement Estimation from Video with Transformer Networks

A Amer, C Bhuvaneshwara, GK Addluri… - … Joint Conference on …, 2023 - ieeexplore.ieee.org
Listeners use short interjections, so-called backchannels, to signify attention or express
agreement. The automatic analysis of this behavior is of key importance for human …

Conan: A usable tool for multimodal conversation analysis

A Penzkofer, P Müller, F Bühler, S Mayer… - Proceedings of the 2021 …, 2021 - dl.acm.org
Multimodal analysis of group behavior is a key task in human-computer interaction, and in
the social and behavioral sciences, but is often limited to more easily controllable laboratory …

Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings

C Yang, M Chen, Y Wang, Y Wang - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Audio-visual speaker diarization refers to the task of identifying" who spoke when" by using
both audio and video data. Although previous fusion-based approaches have shown …

No-audio speaking status detection in crowded settings via visual pose-based filtering and wearable acceleration

J Vargas-Quiros, L Cabrera-Quiros, H Hung - arXiv preprint arXiv …, 2022 - arxiv.org
Recognizing who is speaking in a crowded scene is a key challenge towards the
understanding of the social interactions going on within. Detecting speaking status from …

ConfLab: a data collection concept, dataset, and benchmark for machine analysis of free-standing social interactions in the wild

C Raman, J Vargas Quiros, S Tan… - Advances in …, 2022 - proceedings.neurips.cc
Recording the dynamics of unscripted human interactions in the wild is challenging due to
the delicate trade-offs between several factors: participant privacy, ecological validity, data …