Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark …
We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on …
R Gao, K Grauman - 2021 IEEE/CVF Conference on Computer …, 2021 - ieeexplore.ieee.org
We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous back-ground sounds …
S Mo, P Morgado - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video. Since collecting ground-truth annotations of sounding …
F Qingyun, H Dapeng, W Zhaokui - arXiv preprint arXiv:2111.00273, 2021 - arxiv.org
Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust in the open world. To fully exploit the different …
S Mo, Y Tian - Proceedings of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Sound source localization is a typical and challenging task that predicts the location of sound sources in a video. Previous single-source methods mainly used the audio-visual …
The audio-visual video parsing task aims to temporally parse a video into audio or visual event categories. However, it is labor intensive to temporally annotate audio and visual …
Humans can easily perceive the direction of sound sources in a visual scene, termed sound source localization. Recent studies on learning-based sound source localization have …
The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges …