A systematic literature review on multimodal machine learning: Applications, challenges, gaps and future directions

A Barua, MU Ahmed, S Begum - IEEE Access, 2023 - ieeexplore.ieee.org
Multimodal machine learning (MML) is a tempting multidisciplinary research area where
heterogeneous data from multiple modalities and machine learning (ML) are combined to …

Ava active speaker: An audio-visual dataset for active speaker detection

J Roth, S Chaudhuri, O Klejch, R Marvin… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
Active speaker detection is an important component in video analysis algorithms for
applications such as speaker diarization, video re-targeting for meetings, speech …

Active speakers in context

JL Alcázar, F Caba, L Mai, F Perazzi… - Proceedings of the …, 2020 - openaccess.thecvf.com
Current methods for active speaker detection focus on modeling audiovisual information
from a single speaker. This strategy can be adequate for addressing single-speaker …

Maas: Multi-modal assignation for active speaker detection

JL Alcázar, F Caba, AK Thabet… - Proceedings of the …, 2021 - openaccess.thecvf.com
Active speaker detection requires a solid integration of multi-modal cues. While individual
modalities can approximate a solution, accurate predictions can only be achieved by …

RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis

C Beyan, M Shahid, V Murino - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
We present an automatic voice activity detection (VAD) method that is solely based on visual
cues. Unlike traditional approaches processing audio, we show that upper body motion …

Leveraging Visual Supervision for Array-Based Active Speaker Detection and Localization

D Berghi, PJB Jackson - IEEE/ACM Transactions on Audio …, 2023 - ieeexplore.ieee.org
Conventional audio-visual approaches for active speaker detection (ASD) typically rely on
visually pre-extracted face tracks and the corresponding single-channel audio to find the …

[HTML][HTML] Prediction of who will be next speaker and when using mouth-opening pattern in multi-party conversation

R Ishii, K Otsuka, S Kumano, R Higashinaka… - Multimodal …, 2019 - mdpi.com
We investigated the mouth-opening transition pattern (MOTP), which represents the change
of mouth-opening degree during the end of an utterance, and used it to predict the next …

End-to-end lip synchronisation based on pattern classification

YJ Kim, HS Heo, SW Chung… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org
The goal of this work is to synchronise audio and video of a talking face using deep neural
network models. Existing works have trained networks on proxy tasks such as cross-modal …

Voice activity detection by upper body motion analysis and unsupervised domain adaptation

M Shahid, C Beyan, V Murino - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
We present a novel vision-based voice activity detection (VAD) method that relies only on
automatic upper body motion (UBM) analysis. Traditionally, VAD is performed using audio …

Audio-video fusion strategies for active speaker detection in meetings

L Pibre, F Madrigal, C Equoy, F Lerasle… - Multimedia Tools and …, 2023 - Springer
Meetings are a common activity in professional contexts, and it remains challenging to
endow vocal assistants with advanced functionalities to facilitate meeting management. In …