LLark: A Multimodal Instruction-Following Language Model for Music

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arXiv preprint arXiv …, 2024 - arxiv.org

In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Anygpt: Unified multimodal llm with discrete sequence modeling

J Zhan, J Dai, J Ye, Y Zhou, D Zhang, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete
representations for the unified processing of various modalities, including speech, text …

被引用次数：54 相关文章所有 2 个版本

[PDF] arxiv.org

Muchomusic: Evaluating music understanding in multimodal audio-language models

B Weck, I Manco, E Benetos, E Quinton… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal models that jointly process audio and language hold great promise in audio
understanding and are increasingly being adopted in the music domain. By allowing users …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

E Labb, T Pellegrini, J Pinquier - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org

Automated Audio Captioning (AAC) involves generating natural language descriptions of
audio content, using encoder-decoder architectures. An audio encoder produces audio …

被引用次数：10 相关文章所有 16 个版本

[PDF] arxiv.org

Evaluation of pretrained language models on music understanding

Y Vasilakis, R Bittner, J Pauwels - arXiv preprint arXiv:2409.11449, 2024 - arxiv.org

Music-text multimodal systems have enabled new approaches to Music Information
Research (MIR) applications such as audio-to-text and text-to-audio retrieval, text-based …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning

I Manco, J Salamon, O Nieto - arXiv preprint arXiv:2409.11498, 2024 - arxiv.org

Audio-text contrastive models have become a powerful approach in music representation
learning. Despite their empirical success, however, little is known about the influence of key …

[PDF][PDF] Talking to Your Recs: Multimodal Embeddings For Recommendation and Retrieval

S Oramas, A Ferraro, A Sarasua, F Gouyon - 2024 - ceur-ws.org

Abstract Large Language Models (LLMs) excel at understanding complex natural language
requests, and even providing recommendations, but they often rely on incomplete or …

高级搜索

QQ 群