- 学术资源搜索

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

被引用次数：65 相关文章

[PDF] sciencedirect.com

A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier

The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

被引用次数：211 相关文章所有 6 个版本

[PDF] arxiv.org

Audiolm: a language modeling approach to audio generation

Z Borsos, R Marinier, D Vincent… - … ACM transactions on …, 2023 - ieeexplore.ieee.org

We introduce AudioLM, a framework for high-quality audio generation with long-term
consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts …

被引用次数：557 相关文章所有 5 个版本

[PDF] thecvf.com

Video probabilistic diffusion models in projected latent space

S Yu, K Sohn, S Kim, J Shin - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Despite the remarkable progress in deep generative models, synthesizing high-resolution
and temporally coherent videos still remains a challenge due to their high-dimensionality …

被引用次数：164 相关文章所有 6 个版本

[PDF] neurips.cc

Megabyte: Predicting million-byte sequences with multiscale transformers

L Yu, D Simig, C Flaherty… - Advances in …, 2023 - proceedings.neurips.cc

Autoregressive transformers are spectacular models for short sequences but scale poorly to
long sequences such as high-resolution images, podcasts, code, or books. We proposed …

被引用次数：74 相关文章所有 5 个版本

[PDF] arxiv.org

Soundstream: An end-to-end neural audio codec

N Zeghidour, A Luebs, A Omran… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org

We present SoundStream, a novel neural audio codec that can efficiently compress speech,
music and general audio at bitrates normally targeted by speech-tailored codecs …

被引用次数：678 相关文章所有 5 个版本

[PDF] mlr.press

Grad-tts: A diffusion probabilistic model for text-to-speech

V Popov, I Vovk, V Gogoryan… - International …, 2021 - proceedings.mlr.press

Recently, denoising diffusion probabilistic models and generative score matching have
shown high potential in modelling complex data distributions while stochastic calculus has …

被引用次数：550 相关文章所有 5 个版本

[PDF] thecvf.com

Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation

X Huang, R Shao, Q Zhang, H Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent text-to-3D methods employing diffusion models have made significant
advancements in 3D human generation. However these approaches face challenges due to …

被引用次数：56 相关文章所有 4 个版本

[PDF] ieee.org

Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models

S Bond-Taylor, A Leach, Y Long… - IEEE transactions on …, 2021 - ieeexplore.ieee.org

Deep generative models are a class of techniques that train deep neural networks to model
the distribution of training samples. Research has fragmented into various interconnected …

被引用次数：627 相关文章所有 12 个版本

[PDF] arxiv.org

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org

Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

被引用次数：455 相关文章所有 2 个版本

高级搜索

QQ 群

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

A review of deep learning techniques for speech processing

Audiolm: a language modeling approach to audio generation

Video probabilistic diffusion models in projected latent space

Megabyte: Predicting million-byte sequences with multiscale transformers

Soundstream: An end-to-end neural audio codec

Grad-tts: A diffusion probabilistic model for text-to-speech

Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation

Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models

A survey on neural speech synthesis

引用