Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

R Tao, Z Pan, RK Das, X Qian, MZ Shou… - Proceedings of the 29th …, 2021 - dl.acm.org
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or
more speakers. The successful ASD depends on accurate interpretation of short-term and …

Audio-visual cross-attention network for robotic speaker tracking

X Qian, Z Wang, J Wang, G Guan… - IEEE/ACM Transactions …, 2022 - ieeexplore.ieee.org
Audio-visual signals can be used jointly for robotic perception as they complement each
other. Such multi-modal sensory fusion has a clear advantage, especially under noisy …

NeuroHeed: Neuro-steered speaker extraction using EEG signals

Z Pan, M Borsdorf, S Cai, T Schultz… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
Humans possess the remarkable ability to selectively attend to a single speaker amidst
competing voices and background noise, known as selective auditory attention. Recent …

Selective listening by synchronizing speech with lips

Z Pan, R Tao, C Xu, H Li - IEEE/ACM Transactions on Audio …, 2022 - ieeexplore.ieee.org
A speaker extraction algorithm seeks to extract the speech of a target speaker from a multi-
talker speech mixture when given a cue that represents the target speaker, such as a pre …

USEV: Universal speaker extraction with visual cue

Z Pan, M Ge, H Li - IEEE/ACM Transactions on Audio, Speech …, 2022 - ieeexplore.ieee.org
A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-
talker speech mixture. The prior studies focus mostly on speaker extraction from a highly …

Av-sepformer: Cross-attention sepformer for audio-visual target speaker extraction

J Lin, X Cai, H Dinkel, J Chen, Z Yan… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Visual information can serve as an effective cue for target speaker extraction (TSE) and is
vital to improving extraction performance. In this paper, we propose AV-SepFormer, a …

Speaker extraction with co-speech gestures cue

Z Pan, X Qian, H Li - IEEE Signal Processing Letters, 2022 - ieeexplore.ieee.org
Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker
mixture speech. There have been studies to use a pre-recorded speech sample or face …

Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

J Wang, Z Pan, M Zhang, RT Tan, H Li - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Prior studies on audio-visual speech recognition typically assume the visibility of speaking
lips, ignoring the fact that visual occlusion occurs in real-world videos, thus adversely …

VCSE: Time-domain visual-contextual speaker extraction network

J Li, M Ge, Z Pan, L Wang, J Dang - arXiv preprint arXiv:2210.06177, 2022 - arxiv.org
Speaker extraction seeks to extract the target speech in a multi-talker scenario given an
auxiliary reference. Such reference can be auditory, ie, a pre-recorded speech, visual, ie, lip …

Rethinking the visual cues in audio-visual speaker extraction

J Li, M Ge, R Cao, L Wang, J Dang, S Zhang - arXiv preprint arXiv …, 2023 - arxiv.org
The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to
leverage two visual cues, namely speaker identity and synchronization, to enhance …