Soundspaces: Audio-visual navigation in 3d environments

C Chen, U Jain, C Schissler, SVA Gari… - Computer Vision–ECCV …, 2020 - Springer
Moving around in the world is naturally a multisensory experience, but today's embodied
agents are deaf—restricted to solely their visual perception of the environment. We introduce …

Speech2face: Learning the face behind a voice

TH Oh, T Dekel, C Kim, I Mosseri… - Proceedings of the …, 2019 - openaccess.thecvf.com
How much can we infer about a person's looks from the way they speak? In this paper, we
study the task of reconstructing a facial image of a person from a short audio recording of …

VisualEchoes: Spatial Image Representation Learning Through Echolocation

R Gao, C Chen, Z Al-Halah, C Schissler… - Computer Vision–ECCV …, 2020 - Springer
Several animal species (eg, bats, dolphins, and whales) and even visually impaired humans
have the remarkable ability to perform echolocation: a biological sonar used to perceive …

Audio-visual speaker diarization based on spatiotemporal bayesian fusion

ID Gebru, S Ba, X Li, R Horaud - IEEE transactions on pattern …, 2017 - ieeexplore.ieee.org
Speaker diarization consists of assigning speech signals to people engaged in a dialogue.
An audio-visual spatiotemporal diarization model is proposed. The model is well suited for …

On learning associations of faces and voices

C Kim, HV Shin, TH Oh, A Kaspar, M Elgharib… - Computer Vision–ACCV …, 2019 - Springer
In this paper, we study the associations between human faces and voices. Audiovisual
integration, specifically the integration of facial and vocal information is a well-researched …

Audio–visual particle flow smc-phd filtering for multi-speaker tracking

Y Liu, V Kılıç, J Guan, W Wang - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
Sequential Monte Carlo probability hypothesis density (SMC-PHD) filtering is a popular
method used recently for audio-visual (AV) multi-speaker tracking. However, due to the …

Multi-speaker tracking from an audio–visual sensing device

X Qian, A Brutti, O Lanz, M Omologo… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
Compact multi-sensor platforms are portable and thus desirable for robotics and personal-
assistance tasks. However, compared to physically distributed sensors, the size of these …

{MAVL}: Multiresolution analysis of voice localization

M Wang, W Sun, L Qiu - … Symposium on Networked Systems Design and …, 2021 - usenix.org
The ability for a smart speaker to localize a user based on his/her voice opens the door to
many new applications. In this paper, we present a novel system, MAVL, to localize human …

RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis

C Beyan, M Shahid, V Murino - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
We present an automatic voice activity detection (VAD) method that is solely based on visual
cues. Unlike traditional approaches processing audio, we show that upper body motion …

Putting a face to the voice: Fusing audio and visual signals across a video to determine speakers

K Hoover, S Chaudhuri, C Pantofaru, M Slaney… - arXiv preprint arXiv …, 2017 - arxiv.org
In this paper, we present a system that associates faces with voices in a video by fusing
information from the audio and visual signals. The thesis underlying our work is that an …