Muse: Multi-modal target speaker extraction with visual cues

Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

R Tao, Z Pan, RK Das, X Qian, MZ Shou… - Proceedings of the 29th …, 2021 - dl.acm.org

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or
more speakers. The successful ASD depends on accurate interpretation of short-term and …

被引用次数：187 相关文章所有 5 个版本

[PDF] ieee.org

Audio-visual cross-attention network for robotic speaker tracking

X Qian, Z Wang, J Wang, G Guan… - IEEE/ACM Transactions …, 2022 - ieeexplore.ieee.org

Audio-visual signals can be used jointly for robotic perception as they complement each
other. Such multi-modal sensory fusion has a clear advantage, especially under noisy …

被引用次数：35 相关文章所有 2 个版本

[PDF] ieee.org

NeuroHeed: Neuro-steered speaker extraction using EEG signals

Z Pan, M Borsdorf, S Cai, T Schultz… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org

Humans possess the remarkable ability to selectively attend to a single speaker amidst
competing voices and background noise, known as selective auditory attention. Recent …

被引用次数：14 相关文章所有 2 个版本

[PDF] ieee.org

Selective listening by synchronizing speech with lips

Z Pan, R Tao, C Xu, H Li - IEEE/ACM Transactions on Audio …, 2022 - ieeexplore.ieee.org

A speaker extraction algorithm seeks to extract the speech of a target speaker from a multi-
talker speech mixture when given a cue that represents the target speaker, such as a pre …

被引用次数：48 相关文章所有 4 个版本

[PDF] ieee.org

USEV: Universal speaker extraction with visual cue

Z Pan, M Ge, H Li - IEEE/ACM Transactions on Audio, Speech …, 2022 - ieeexplore.ieee.org

A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-
talker speech mixture. The prior studies focus mostly on speaker extraction from a highly …

被引用次数：49 相关文章所有 4 个版本

[PDF] arxiv.org

Av-sepformer: Cross-attention sepformer for audio-visual target speaker extraction

J Lin, X Cai, H Dinkel, J Chen, Z Yan… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

Visual information can serve as an effective cue for target speaker extraction (TSE) and is
vital to improving extraction performance. In this paper, we propose AV-SepFormer, a …

被引用次数：19 相关文章所有 3 个版本

[PDF] ieee.org

Speaker extraction with co-speech gestures cue

Z Pan, X Qian, H Li - IEEE Signal Processing Letters, 2022 - ieeexplore.ieee.org

Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker
mixture speech. There have been studies to use a pre-recorded speech sample or face …

被引用次数：28 相关文章所有 3 个版本

[PDF] aaai.org

Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

J Wang, Z Pan, M Zhang, RT Tan, H Li - Proceedings of the AAAI …, 2024 - ojs.aaai.org

Prior studies on audio-visual speech recognition typically assume the visibility of speaking
lips, ignoring the fact that visual occlusion occurs in real-world videos, thus adversely …

被引用次数：8 相关文章

[PDF] arxiv.org

VCSE: Time-domain visual-contextual speaker extraction network

J Li, M Ge, Z Pan, L Wang, J Dang - arXiv preprint arXiv:2210.06177, 2022 - arxiv.org

Speaker extraction seeks to extract the target speech in a multi-talker scenario given an
auxiliary reference. Such reference can be auditory, ie, a pre-recorded speech, visual, ie, lip …

被引用次数：12 相关文章所有 5 个版本

[PDF] arxiv.org

Rethinking the visual cues in audio-visual speaker extraction

J Li, M Ge, R Cao, L Wang, J Dang, S Zhang - arXiv preprint arXiv …, 2023 - arxiv.org

The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to
leverage two visual cues, namely speaker identity and synchronization, to enhance …

被引用次数：11 相关文章所有 5 个版本

高级搜索

QQ 群