Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Anygpt: Unified multimodal llm with discrete sequence modeling

J Zhan, J Dai, J Ye, Y Zhou, D Zhang, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete
representations for the unified processing of various modalities, including speech, text …

Muchomusic: Evaluating music understanding in multimodal audio-language models

B Weck, I Manco, E Benetos, E Quinton… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal models that jointly process audio and language hold great promise in audio
understanding and are increasingly being adopted in the music domain. By allowing users …

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

E Labb, T Pellegrini, J Pinquier - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Automated Audio Captioning (AAC) involves generating natural language descriptions of
audio content, using encoder-decoder architectures. An audio encoder produces audio …

Evaluation of pretrained language models on music understanding

Y Vasilakis, R Bittner, J Pauwels - arXiv preprint arXiv:2409.11449, 2024 - arxiv.org
Music-text multimodal systems have enabled new approaches to Music Information
Research (MIR) applications such as audio-to-text and text-to-audio retrieval, text-based …

Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning

I Manco, J Salamon, O Nieto - arXiv preprint arXiv:2409.11498, 2024 - arxiv.org
Audio-text contrastive models have become a powerful approach in music representation
learning. Despite their empirical success, however, little is known about the influence of key …

[PDF][PDF] Talking to Your Recs: Multimodal Embeddings For Recommendation and Retrieval

S Oramas, A Ferraro, A Sarasua, F Gouyon - 2024 - ceur-ws.org
Abstract Large Language Models (LLMs) excel at understanding complex natural language
requests, and even providing recommendations, but they often rely on incomplete or …