Sparks of large audio models: A survey and outlook

S Latif, M Shoukat, F Shamshad, M Usama… - arXiv preprint arXiv …, 2023 - arxiv.org
This survey paper provides a comprehensive overview of the recent advancements and
challenges in applying large language models to the field of audio signal processing. Audio …

Towards audio language modeling-an overview

H Wu, X Chen, YC Lin, K Chang, HL Chung… - arXiv preprint arXiv …, 2024 - arxiv.org
Neural audio codecs are initially introduced to compress audio data into compact codes to
reduce transmission latency. Researchers recently discovered the potential of codecs as …

VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks

S Maiti, Y Peng, S Choi, J Jung… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech
recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates …

Diverse and aligned audio-to-video generation via text-to-video model adaptation

G Yariv, I Gat, S Benaim, L Wolf, I Schwartz… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
We consider the task of generating diverse and realistic videos guided by natural audio
samples from a wide variety of semantic classes. For this task, the videos are required to be …

Speechgen: Unlocking the generative power of speech language models with prompts

H Wu, KW Chang, YK Wu, H Lee - arXiv preprint arXiv:2306.02207, 2023 - arxiv.org
Large language models (LLMs) have gained considerable attention for Artificial Intelligence
Generated Content (AIGC), particularly with the emergence of ChatGPT. However, the direct …

Towards General-Purpose Text-Instruction-Guided Voice Conversion

CY Kuan, CA Li, TY Hsu, TY Lin… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
This paper introduces a novel voice conversion (VC) model, guided by text instructions such
as “articulate slowly with a deep tone “or “speak in a cheerful boyish voice”. Unlike …

An exploration of in-context learning for speech language model

MH Hsu, KW Chang, SW Li, H Lee - arXiv preprint arXiv:2310.12477, 2023 - arxiv.org
Ever since the development of GPT-3 in the natural language processing (NLP) field, in-
context learning (ICL) has played an important role in utilizing large language models …

Spirit-lm: Interleaved spoken and written language model

TA Nguyen, B Muller, B Yu, MR Costa-Jussa… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and
speech. Our model is based on a pretrained text language model that we extend to the …

Speechprompt: Prompting speech language models for speech processing tasks

KW Chang, H Wu, YK Wang, YK Wu… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Prompting has become a practical method for utilizing pre-trained language models (LMs).
This approach offers several advantages. It allows an LM to adapt to new tasks with minimal …

Prompting and adapter tuning for self-supervised encoder-decoder speech model

KW Chang, MH Chen, YP Lin, JN Hsu… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Prompting and adapter tuning have emerged as efficient alternatives to fine-tuning (FT)
methods. However, existing studies on speech prompting focused on classification tasks …