Audioldm: Text-to-audio generation with latent diffusion models

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier

The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

被引用次数：102 相关文章所有 6 个版本

[PDF] arxiv.org

Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, C Li, J Dong, D Su, C Chu… - arXiv preprint arXiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

被引用次数：54 相关文章所有 2 个版本

[PDF] neurips.cc

Simple and controllable music generation

J Copet, F Kreuk, I Gat, T Remez… - Advances in …, 2024 - proceedings.neurips.cc

We tackle the task of conditional music generation. We introduce MusicGen, a single
Language Model (LM) that operates over several streams of compressed discrete music …

被引用次数：210 相关文章所有 9 个版本

[PDF] arxiv.org

Next-gpt: Any-to-any multimodal llm

S Wu, H Fei, L Qu, W Ji, TS Chua - arXiv preprint arXiv:2309.05519, 2023 - arxiv.org

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …

被引用次数：224 相关文章所有 4 个版本

[PDF] thecvf.com

Your diffusion model is secretly a zero-shot classifier

AC Li, M Prabhudesai, S Duggal… - Proceedings of the …, 2023 - openaccess.thecvf.com

The recent wave of large-scale text-to-image diffusion models has dramatically increased
our text-based image generation abilities. These models can generate realistic images for a …

被引用次数：110 相关文章所有 9 个版本

[PDF] neurips.cc

Any-to-any generation via composable diffusion

Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …

被引用次数：86 相关文章所有 8 个版本

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

被引用次数：40 相关文章所有 3 个版本

[PDF] arxiv.org

Text-to-audio generation using instruction-tuned llm and latent diffusion model

D Ghosal, N Majumder, A Mehrish, S Poria - arXiv preprint arXiv …, 2023 - arxiv.org

The immense scale of the recent large language models (LLM) allows many interesting
properties, such as, instruction-and chain-of-thought-based fine-tuning, that has significantly …

被引用次数：93 相关文章所有 3 个版本

[PDF] arxiv.org

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

X Mei, C Meng, H Liu, Q Kong, T Ko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …

被引用次数：78 相关文章所有 3 个版本

[PDF] arxiv.org

Audioldm 2: Learning holistic audio generation with self-supervised pretraining

H Liu, Y Yuan, X Liu, X Mei, Q Kong… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

Although audio generation shares commonalities across different types of audio, such as
speech, music, and sound effects, designing models for each type requires careful …

被引用次数：52 相关文章所有 5 个版本

高级搜索

QQ 群