Animal vocalisations and natural soundscapes are fascinating objects of study, and contain valuable evidence about animal behaviours, populations and ecosystems. They are studied …
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation …
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self- supervised representation learning from audio spectrograms. Following the Transformer …
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark …
We introduce Noise2Music, where a series of diffusion models is trained to generate high- quality 30-second music clips from text prompts. Two types of diffusion models, a generator …
Recommending purely cold-start items is a long-standing and fundamental challenge in the recommender systems. Without any historical interaction on cold-start items, the …
Most existing datasets for sound event recognition (SER) are relatively small and/or domain- specific, with the exception of AudioSet, based on over 2 M tracks from YouTube videos and …
The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to …
Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music …