Automated audio captioning: An overview of recent progress and new challenges

X Mei, X Liu, MD Plumbley, W Wang - … journal on audio, speech, and music …, 2022 - Springer
Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …

Automatic design of machine learning via evolutionary computation: A survey

N Li, L Ma, T Xing, G Yu, C Wang, Y Wen, S Cheng… - Applied Soft …, 2023 - Elsevier
Abstract Machine learning (ML), as the most promising paradigm to discover deep
knowledge from data, has been widely applied to practical applications, such as …

Audioldm: Text-to-audio generation with latent diffusion models

H Liu, Z Chen, Y Yuan, X Mei, X Liu, D Mandic… - arXiv preprint arXiv …, 2023 - arxiv.org
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …

Clap learning audio concepts from natural language supervision

B Elizalde, S Deshmukh, M Al Ismail… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Mainstream machine listening models are trained to learn audio concepts under the
paradigm of one class label to many recordings focusing on one task. Learning under such …

On the use of AI-based tools like ChatGPT to support management research

B Burger, DK Kanbach, S Kraus, M Breier… - European Journal of …, 2023 - emerald.com
Purpose The article discusses the current relevance of artificial intelligence (AI) in research
and how AI improves various research methods. This article focuses on the practical case …

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Y Wu, K Chen, T Zhang, Y Hui… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Contrastive learning has shown remarkable success in the field of multimodal
representation learning. In this paper, we propose a pipeline of contrastive language-audio …

Masked autoencoders that listen

PY Huang, H Xu, J Li, A Baevski… - Advances in …, 2022 - proceedings.neurips.cc
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-
supervised representation learning from audio spectrograms. Following the Transformer …

Dawn of the transformer era in speech emotion recognition: closing the valence gap

J Wagner, A Triantafyllopoulos… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Recent advances in transformer-based architectures have shown promise in several
machine learning tasks. In the audio domain, such architectures have been successfully …

Ast: Audio spectrogram transformer

Y Gong, YA Chung, J Glass - arXiv preprint arXiv:2104.01778, 2021 - arxiv.org
In the past decade, convolutional neural networks (CNNs) have been widely adopted as the
main building block for end-to-end audio classification models, which aim to learn a direct …

Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text

H Akbari, L Yuan, R Qian… - Advances in …, 2021 - proceedings.neurips.cc
We present a framework for learning multimodal representations from unlabeled data using
convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer …