Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Songcreator: Lyrics-based universal song generation

S Lei, Y Zhou, B Tang, MWY Lam, F Liu, H Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Music is an integral part of human culture, embodying human intelligence and creativity, of
which songs compose an essential part. While various aspects of song generation have …

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Smitin: Self-monitored inference-time intervention for generative music transformers

J Koo, G Wichern, FG Germain… - IEEE Open Journal …, 2025 - ieeexplore.ieee.org
We introduce Self-Monitored Inference-Time INtervention (SMITIN), an approach for
controlling an autoregressive generative music transformer using classifier probes. These …

Discogs-VI: A musical version identification dataset based on public editorial metadata

RO Araz, X Serra, D Bogdanov - arXiv preprint arXiv:2410.17400, 2024 - arxiv.org
Current version identification (VI) datasets often lack sufficient size and musical diversity to
train robust neural networks (NNs). Additionally, their non-representative clique size …

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

YB Lin, Y Tian, L Yang, G Bertasius, H Wang - arXiv preprint arXiv …, 2024 - arxiv.org
We present a framework for learning to generate background music from video inputs.
Unlike existing works that rely on symbolic musical annotations, which are limited in quantity …

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

S Wu, Y Wang, R Yuan, Z Guo, X Tan, G Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Challenges in managing linguistic diversity and integrating various musical modalities are
faced by current music information retrieval systems. These limitations reduce their …

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

B Wang, L Zhuo, Z Wang, C Bao, W Chengjing… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal music generation aims to produce music from diverse input modalities, including
text, videos, and images. Existing methods use a common embedding space for multimodal …

MusicScore: A Dataset for Music Score Modeling and Generation

Y Lin, Z Dai, Q Kong - arXiv preprint arXiv:2406.11462, 2024 - arxiv.org
Music scores are written representations of music and contain rich information about musical
components. The visual information on music scores includes notes, rests, staff lines, clefs …

SymPAC: Scalable Symbolic Music Generation With Prompts And Constraints

H Chen, JBL Smith, J Spijkervet, JC Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Progress in the task of symbolic music generation may be lagging behind other tasks like
audio and text generation, in part because of the scarcity of symbolic training data. In this …