VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Z Tian, Z Liu, R Yuan, J Pan, X Huang, Q Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we systematically study music generation conditioned solely on the video. First,
we present a large-scale dataset comprising 190K video-music pairs, including various …

From Efficient Multimodal Models to World Models: A Survey

X Mai, Z Tao, J Lin, H Wang, Y Chang, Y Kang… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal Large Models (MLMs) are becoming a significant research focus, combining
powerful large language models with multimodal learning to perform complex tasks across …