Cross-modal supervision for learning active speaker detection in video

A Nagrani, JS Chung, W Xie, A Zisserman - Computer Speech & Language, 2020 - Elsevier

The objective of this work is speaker recognition under noisy and unconstrained conditions.
We make two key contributions. First, we introduce a very large-scale audio-visual dataset …

被引用次数：790 相关文章所有 11 个版本

[PDF] arxiv.org

Self-supervised learning of audio-visual objects from video

T Afouras, A Owens, JS Chung, A Zisserman - Computer Vision–ECCV …, 2020 - Springer

Our objective is to transform a video into a set of discrete audio-visual objects using self-
supervised learning. To this end, we introduce a model that uses attention to localize and …

被引用次数：290 相关文章所有 8 个版本

[PDF] acm.org

Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

R Tao, Z Pan, RK Das, X Qian, MZ Shou… - Proceedings of the 29th …, 2021 - dl.acm.org

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or
more speakers. The successful ASD depends on accurate interpretation of short-term and …

被引用次数：188 相关文章所有 5 个版本

[PDF] arxiv.org

Voxceleb: a large-scale speaker identification dataset

A Nagrani, JS Chung, A Zisserman - arXiv preprint arXiv:1706.08612, 2017 - arxiv.org

Most existing datasets for speaker identification contain samples obtained under quite
constrained conditions, and are usually hand-annotated, hence limited in size. The goal of …

被引用次数：2822 相关文章所有 15 个版本

[PDF] ucl.ac.uk

Human movement datasets: An interdisciplinary scoping review

T Olugbade, M Bieńkiewicz, G Barbareschi… - ACM Computing …, 2022 - dl.acm.org

Movement dataset reviews exist but are limited in coverage, both in terms of size and
research discipline. While topic-specific reviews clearly have their merit, it is critical to have a …

被引用次数：24 相关文章所有 8 个版本

[PDF] ox.ac.uk

Out of time: automated lip sync in the wild

JS Chung, A Zisserman - … Vision–ACCV 2016 Workshops: ACCV 2016 …, 2017 - Springer

The goal of this work is to determine the audio-video synchronisation between mouth motion
and speech in a video. We propose a two-stream ConvNet architecture that enables the …

被引用次数：797 相关文章所有 8 个版本

[PDF] arxiv.org

Ava active speaker: An audio-visual dataset for active speaker detection

J Roth, S Chaudhuri, O Klejch, R Marvin… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org

Active speaker detection is an important component in video analysis algorithms for
applications such as speaker diarization, video re-targeting for meetings, speech …

被引用次数：190 相关文章所有 6 个版本

[PDF] thecvf.com

A light weight model for active speaker detection

J Liao, H Duan, K Feng, W Zhao… - Proceedings of the …, 2023 - openaccess.thecvf.com

Active speaker detection is a challenging task in audio-visual scenarios, with the aim to
detect who is speaking in one or more speaker scenarios. This task has received …

被引用次数：36 相关文章所有 8 个版本

[PDF] ieee.org

Active Speaker Detection using Audio, Visual and Depth Modalities: A Survey

SNAM Robi, MAZM Ariffin, MAM Izhar, N Ahmad… - IEEE …, 2024 - ieeexplore.ieee.org

The rapid progress of multimodal signal processing in recent years has cleared the way for
novel applications in human-computer interaction, surveillance, and telecommunication …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Learning long-term spatial-temporal graphs for active speaker detection

K Min, S Roy, S Tripathi, T Guha… - European Conference on …, 2022 - Springer

Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it
requires learning effective audiovisual features and spatial-temporal correlations over long …

被引用次数：29 相关文章所有 7 个版本

高级搜索

QQ 群