Z Wang, Q Duan, YW Tai, CK Tang - arXiv e-prints, 2024 - ui.adsabs.harvard.edu
Abstract We introduce C3LLM (Conditioned-on-Three-Modalities Large Language Models),
a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio …