[HTML][HTML] Automated audio captioning: An overview of recent progress and new challenges

X Mei, X Liu, MD Plumbley, W Wang - … journal on audio, speech, and music …, 2022 - Springer
Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …

Clap learning audio concepts from natural language supervision

B Elizalde, S Deshmukh, M Al Ismail… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Mainstream machine listening models are trained to learn audio concepts under the
paradigm of one class label to many recordings focusing on one task. Learning under such …

Ast: Audio spectrogram transformer

Y Gong, YA Chung, J Glass - arXiv preprint arXiv:2104.01778, 2021 - arxiv.org
In the past decade, convolutional neural networks (CNNs) have been widely adopted as the
main building block for end-to-end audio classification models, which aim to learn a direct …

Listen, think, and understand

Y Gong, H Luo, AH Liu, L Karlinsky, J Glass - arXiv preprint arXiv …, 2023 - arxiv.org
The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is
crucial for many applications. Although significant progress has been made in this area …

A comprehensive review of polyphonic sound event detection

TK Chan, CS Chin - IEEE Access, 2020 - ieeexplore.ieee.org
One of the most amazing functions of the human auditory system is the ability to detect all
kinds of sound events in the environment. With the technologies and hardware advances …

Latent variable sequential set transformers for joint multi-agent motion prediction

R Girgis, F Golemo, F Codevilla, M Weiss… - arXiv preprint arXiv …, 2021 - arxiv.org
Robust multi-agent trajectory prediction is essential for the safe control of robotic systems. A
major challenge is to efficiently learn a representation that approximates the true joint …

Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation

I Martín-Morató, A Mesaros - IEEE/ACM transactions on audio …, 2023 - ieeexplore.ieee.org
Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the
specific format of the strong labels necessary for sound event detection is not easily …

Conditional sound generation using neural discrete time-frequency representation learning

X Liu, T Iqbal, J Zhao, Q Huang… - 2021 IEEE 31st …, 2021 - ieeexplore.ieee.org
Deep generative models have recently achieved impressive performance in speech and
music synthesis. However, compared to the generation of those domain-specific sounds …

A transformer-based audio captioning model with keyword estimation

Y Koizumi, R Masumura, K Nishida, M Yasuda… - arXiv preprint arXiv …, 2020 - arxiv.org
One of the problems with automated audio captioning (AAC) is the indeterminacy in word
selection corresponding to the audio event/scene. Since one acoustic event/scene can be …

Cmkd: Cnn/transformer-based cross-model knowledge distillation for audio classification

Y Gong, S Khurana, A Rouditchenko… - arXiv preprint arXiv …, 2022 - arxiv.org
Audio classification is an active research area with a wide range of applications. Over the
past decade, convolutional neural networks (CNNs) have been the de-facto standard …