Audio self-supervised learning: A survey

S Liu, A Mallol-Ragolta, E Parada-Cabaleiro, K Qian… - Patterns, 2022 - cell.com
Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised
learning (SSL) targets discovering general representations from large-scale data. This …

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021 - proceedings.neurips.cc
Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

Wav2clip: Learning robust audio representations from clip

HH Wu, P Seetharaman, K Kumar… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
We propose Wav2CLIP, a robust audio representation learning method by distilling from
Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on …

Visualvoice: Audio-visual speech separation with cross-modal consistency

R Gao, K Grauman - 2021 IEEE/CVF Conference on Computer …, 2021 - ieeexplore.ieee.org
We introduce a new approach for audio-visual speech separation. Given a video, the goal is
to extract the speech associated with a face in spite of simultaneous back-ground sounds …

A closer look at weakly-supervised audio-visual source localization

S Mo, P Morgado - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Audio-visual source localization is a challenging task that aims to predict the location of
visual sound sources in a video. Since collecting ground-truth annotations of sounding …

Cross-modality fusion transformer for multispectral object detection

F Qingyun, H Dapeng, W Zhaokui - arXiv preprint arXiv:2111.00273, 2021 - arxiv.org
Multispectral image pairs can provide the combined information, making object detection
applications more reliable and robust in the open world. To fully exploit the different …

Audio-visual grouping network for sound localization from mixtures

S Mo, Y Tian - Proceedings of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Sound source localization is a typical and challenging task that predicts the location of
sound sources in a video. Previous single-source methods mainly used the audio-visual …

Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing

YB Lin, HY Tseng, HY Lee, YY Lin… - Advances in Neural …, 2021 - proceedings.neurips.cc
The audio-visual video parsing task aims to temporally parse a video into audio or visual
event categories. However, it is labor intensive to temporally annotate audio and visual …

Sound source localization is all about cross-modal alignment

A Senocak, H Ryu, J Kim, TH Oh… - Proceedings of the …, 2023 - openaccess.thecvf.com
Humans can easily perceive the direction of sound sources in a visual scene, termed sound
source localization. Recent studies on learning-based sound source localization have …

Conditional generation of audio from video via foley analogies

Y Du, Z Chen, J Salamon, B Russell… - Proceedings of the …, 2023 - openaccess.thecvf.com
The sound effects that designers add to videos are designed to convey a particular artistic
effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges …