Perceiver: General perception with iterative attention

A Jaegle, F Gimeno, A Brock… - International …, 2021 - proceedings.mlr.press
Biological systems understand the world by simultaneously processing high-dimensional
inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The …

Panns: Large-scale pretrained audio neural networks for audio pattern recognition

Q Kong, Y Cao, T Iqbal, Y Wang… - … on Audio, Speech …, 2020 - ieeexplore.ieee.org
Audio pattern recognition is an important research topic in the machine learning area, and
includes several tasks such as audio tagging, acoustic scene classification, music …

Listen, think, and understand

Y Gong, H Luo, AH Liu, L Karlinsky, J Glass - arXiv preprint arXiv …, 2023 - arxiv.org
The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is
crucial for many applications. Although significant progress has been made in this area …

Attention, please! A survey of neural attention models in deep learning

A de Santana Correia, EL Colombini - Artificial Intelligence Review, 2022 - Springer
In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …

Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation

Y Gong, YA Chung, J Glass - IEEE/ACM Transactions on Audio …, 2021 - ieeexplore.ieee.org
Audio tagging is an active research area and has a wide range of applications. Since the
release of AudioSet, great progress has been made in advancing model performance, which …

Unified multisensory perception: Weakly-supervised audio-visual video parsing

Y Tian, D Li, C Xu - Computer Vision–ECCV 2020: 16th European …, 2020 - Springer
In this paper, we introduce a new problem, named audio-visual video parsing, which aims to
parse a video into temporal event segments and label them as either audible, visible, or …

Audio retrieval with natural language queries: A benchmark study

AS Koepke, AM Oncescu, JF Henriques… - IEEE Transactions …, 2022 - ieeexplore.ieee.org
The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the
goal is to retrieve the audio content from a pool of candidates that best matches a given …

A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling

Y Wang, J Li, F Metze - ICASSP 2019-2019 IEEE International …, 2019 - ieeexplore.ieee.org
Sound event detection (SED) entails two subtasks: recognizing what types of sound events
are present in an audio stream (audio tagging), and pinpointing their onset and offset times …

Contrastive positive sample propagation along the audio-visual event line

J Zhou, D Guo, M Wang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org
Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …

Audio retrieval with natural language queries

AM Oncescu, A Koepke, JF Henriques, Z Akata… - arXiv preprint arXiv …, 2021 - arxiv.org
We consider the task of retrieving audio using free-form natural language queries. To study
this problem, which has received limited attention in the existing literature, we introduce …