Active speaker detection with audio-visual co-training

A systematic literature review on multimodal machine learning: Applications, challenges, gaps and future directions

A Barua, MU Ahmed, S Begum - IEEE Access, 2023 - ieeexplore.ieee.org

Multimodal machine learning (MML) is a tempting multidisciplinary research area where
heterogeneous data from multiple modalities and machine learning (ML) are combined to …

被引用次数：52 相关文章所有 4 个版本

[PDF] arxiv.org

Ava active speaker: An audio-visual dataset for active speaker detection

J Roth, S Chaudhuri, O Klejch, R Marvin… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org

Active speaker detection is an important component in video analysis algorithms for
applications such as speaker diarization, video re-targeting for meetings, speech …

被引用次数：190 相关文章所有 6 个版本

[PDF] thecvf.com

Active speakers in context

JL Alcázar, F Caba, L Mai, F Perazzi… - Proceedings of the …, 2020 - openaccess.thecvf.com

Current methods for active speaker detection focus on modeling audiovisual information
from a single speaker. This strategy can be adequate for addressing single-speaker …

被引用次数：98 相关文章所有 8 个版本

[PDF] thecvf.com

Maas: Multi-modal assignation for active speaker detection

JL Alcázar, F Caba, AK Thabet… - Proceedings of the …, 2021 - openaccess.thecvf.com

Active speaker detection requires a solid integration of multi-modal cues. While individual
modalities can approximate a solution, accurate predictions can only be achieved by …

被引用次数：60 相关文章所有 8 个版本

RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis

C Beyan, M Shahid, V Murino - IEEE Transactions on …, 2020 - ieeexplore.ieee.org

We present an automatic voice activity detection (VAD) method that is solely based on visual
cues. Unlike traditional approaches processing audio, we show that upper body motion …

被引用次数：29 相关文章所有 4 个版本

Leveraging Visual Supervision for Array-Based Active Speaker Detection and Localization

D Berghi, PJB Jackson - IEEE/ACM Transactions on Audio …, 2023 - ieeexplore.ieee.org

Conventional audio-visual approaches for active speaker detection (ASD) typically rely on
visually pre-extracted face tracks and the corresponding single-channel audio to find the …

被引用次数：5 相关文章所有 5 个版本

[PDF] mdpi.com

Prediction of who will be next speaker and when using mouth-opening pattern in multi-party conversation

R Ishii, K Otsuka, S Kumano, R Higashinaka… - Multimodal …, 2019 - mdpi.com

We investigated the mouth-opening transition pattern (MOTP), which represents the change
of mouth-opening degree during the end of an utterance, and used it to predict the next …

被引用次数：20 相关文章所有 4 个版本

[PDF] arxiv.org

End-to-end lip synchronisation based on pattern classification

YJ Kim, HS Heo, SW Chung… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org

The goal of this work is to synchronise audio and video of a talking face using deep neural
network models. Existing works have trained networks on proxy tasks such as cross-modal …

被引用次数：17 相关文章所有 3 个版本

[PDF] thecvf.com

Voice activity detection by upper body motion analysis and unsupervised domain adaptation

M Shahid, C Beyan, V Murino - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

We present a novel vision-based voice activity detection (VAD) method that relies only on
automatic upper body motion (UBM) analysis. Traditionally, VAD is performed using audio …

被引用次数：19 相关文章所有 7 个版本

[PDF] arxiv.org

Audio-video fusion strategies for active speaker detection in meetings

L Pibre, F Madrigal, C Equoy, F Lerasle… - Multimedia Tools and …, 2023 - Springer

Meetings are a common activity in professional contexts, and it remains challenging to
endow vocal assistants with advanced functionalities to facilitate meeting management. In …

被引用次数：4 相关文章所有 9 个版本

高级搜索

QQ 群