Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds

S Liu, A Mallol-Ragolta, E Parada-Cabaleiro, K Qian… - Patterns, 2022 - cell.com

Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised
learning (SSL) targets discovering general representations from large-scale data. This …

被引用次数：119 相关文章所有 12 个版本

[PDF] neurips.cc

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021 - proceedings.neurips.cc

Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

被引用次数：617 相关文章所有 8 个版本

[PDF] arxiv.org

Wav2clip: Learning robust audio representations from clip

HH Wu, P Seetharaman, K Kumar… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org

We propose Wav2CLIP, a robust audio representation learning method by distilling from
Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on …

被引用次数：266 相关文章所有 9 个版本

[PDF] arxiv.org

Visualvoice: Audio-visual speech separation with cross-modal consistency

R Gao, K Grauman - 2021 IEEE/CVF Conference on Computer …, 2021 - ieeexplore.ieee.org

We introduce a new approach for audio-visual speech separation. Given a video, the goal is
to extract the speech associated with a face in spite of simultaneous back-ground sounds …

被引用次数：192 相关文章所有 9 个版本

[PDF] neurips.cc

A closer look at weakly-supervised audio-visual source localization

S Mo, P Morgado - Advances in Neural Information …, 2022 - proceedings.neurips.cc

Audio-visual source localization is a challenging task that aims to predict the location of
visual sound sources in a video. Since collecting ground-truth annotations of sounding …

被引用次数：59 相关文章所有 6 个版本

[PDF] arxiv.org

Cross-modality fusion transformer for multispectral object detection

F Qingyun, H Dapeng, W Zhaokui - arXiv preprint arXiv:2111.00273, 2021 - arxiv.org

Multispectral image pairs can provide the combined information, making object detection
applications more reliable and robust in the open world. To fully exploit the different …

被引用次数：153 相关文章所有 4 个版本

[PDF] thecvf.com

Audio-visual grouping network for sound localization from mixtures

S Mo, Y Tian - Proceedings of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com

Sound source localization is a typical and challenging task that predicts the location of
sound sources in a video. Previous single-source methods mainly used the audio-visual …

被引用次数：43 相关文章所有 5 个版本

[PDF] neurips.cc

Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing

YB Lin, HY Tseng, HY Lee, YY Lin… - Advances in Neural …, 2021 - proceedings.neurips.cc

The audio-visual video parsing task aims to temporally parse a video into audio or visual
event categories. However, it is labor intensive to temporally annotate audio and visual …

被引用次数：77 相关文章所有 11 个版本

[PDF] thecvf.com

Sound source localization is all about cross-modal alignment

A Senocak, H Ryu, J Kim, TH Oh… - Proceedings of the …, 2023 - openaccess.thecvf.com

Humans can easily perceive the direction of sound sources in a visual scene, termed sound
source localization. Recent studies on learning-based sound source localization have …

被引用次数：15 相关文章所有 8 个版本

[PDF] thecvf.com

Conditional generation of audio from video via foley analogies

Y Du, Z Chen, J Salamon, B Russell… - Proceedings of the …, 2023 - openaccess.thecvf.com

The sound effects that designers add to videos are designed to convey a particular artistic
effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges …

被引用次数：38 相关文章所有 7 个版本

高级搜索

QQ 群