Tracking the active speaker based on a joint audio-visual observation model

C Chen, U Jain, C Schissler, SVA Gari… - Computer Vision–ECCV …, 2020 - Springer

Moving around in the world is naturally a multisensory experience, but today's embodied
agents are deaf—restricted to solely their visual perception of the environment. We introduce …

被引用次数：290 相关文章所有 6 个版本

[PDF] thecvf.com

Speech2face: Learning the face behind a voice

TH Oh, T Dekel, C Kim, I Mosseri… - Proceedings of the …, 2019 - openaccess.thecvf.com

How much can we infer about a person's looks from the way they speak? In this paper, we
study the task of reconstructing a facial image of a person from a short audio recording of …

被引用次数：203 相关文章所有 10 个版本

[PDF] arxiv.org

VisualEchoes: Spatial Image Representation Learning Through Echolocation

R Gao, C Chen, Z Al-Halah, C Schissler… - Computer Vision–ECCV …, 2020 - Springer

Several animal species (eg, bats, dolphins, and whales) and even visually impaired humans
have the remarkable ability to perform echolocation: a biological sonar used to perceive …

被引用次数：100 相关文章所有 11 个版本

[PDF] arxiv.org

Audio-visual speaker diarization based on spatiotemporal bayesian fusion

ID Gebru, S Ba, X Li, R Horaud - IEEE transactions on pattern …, 2017 - ieeexplore.ieee.org

Speaker diarization consists of assigning speech signals to people engaged in a dialogue.
An audio-visual spatiotemporal diarization model is proposed. The model is well suited for …

被引用次数：124 相关文章所有 12 个版本

[PDF] arxiv.org

On learning associations of faces and voices

C Kim, HV Shin, TH Oh, A Kaspar, M Elgharib… - Computer Vision–ACCV …, 2019 - Springer

In this paper, we study the associations between human faces and voices. Audiovisual
integration, specifically the integration of facial and vocal information is a well-researched …

被引用次数：97 相关文章所有 6 个版本

[PDF] surrey.ac.uk

Audio–visual particle flow smc-phd filtering for multi-speaker tracking

Y Liu, V Kılıç, J Guan, W Wang - IEEE Transactions on …, 2019 - ieeexplore.ieee.org

Sequential Monte Carlo probability hypothesis density (SMC-PHD) filtering is a popular
method used recently for audio-visual (AV) multi-speaker tracking. However, due to the …

被引用次数：67 相关文章所有 4 个版本

[PDF] qmul.ac.uk

Multi-speaker tracking from an audio–visual sensing device

X Qian, A Brutti, O Lanz, M Omologo… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org

Compact multi-sensor platforms are portable and thus desirable for robotics and personal-
assistance tasks. However, compared to physically distributed sensors, the size of these …

被引用次数：62 相关文章所有 11 个版本

[PDF] usenix.org

{MAVL}: Multiresolution analysis of voice localization

M Wang, W Sun, L Qiu - … Symposium on Networked Systems Design and …, 2021 - usenix.org

The ability for a smart speaker to localize a user based on his/her voice opens the door to
many new applications. In this paper, we present a novel system, MAVL, to localize human …

被引用次数：36 相关文章所有 6 个版本

RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis

C Beyan, M Shahid, V Murino - IEEE Transactions on …, 2020 - ieeexplore.ieee.org

We present an automatic voice activity detection (VAD) method that is solely based on visual
cues. Unlike traditional approaches processing audio, we show that upper body motion …

被引用次数：27 相关文章所有 4 个版本

[PDF] arxiv.org

Putting a face to the voice: Fusing audio and visual signals across a video to determine speakers

K Hoover, S Chaudhuri, C Pantofaru, M Slaney… - arXiv preprint arXiv …, 2017 - arxiv.org

In this paper, we present a system that associates faces with voices in a video by fusing
information from the audio and visual signals. The thesis underlying our work is that an …

被引用次数：39 相关文章所有 3 个版本

高级搜索

QQ 群