Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Generative expressive conversational speech synthesis

R Liu, Y Hu, Y Ren, X Yin, H Li - Proceedings of the 32nd ACM …, 2024 - dl.acm.org
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper
speaking style in a user-agent conversation setting. Existing CSS methods employ effective …

Unistyle: Unified style modeling for speaking style captioning and stylistic speech synthesis

X Zhu, W Tian, X Wang, L He, Y Xiao, X Wang… - Proceedings of the …, 2024 - dl.acm.org
Understanding the speaking style, such as the emotion of the interlocutor's speech, and
responding with speech in an appropriate style is a natural occurrence in human …

Style-talker: Finetuning audio language model and style-based text-to-speech model for fast spoken dialogue generation

YA Li, X Jiang, J Darefsky, G Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advancement of large language models (LLMs) has significantly propelled the
development of text-based chatbots, demonstrating their capability to engage in coherent …

Re-evaluating the Command-and-Control Paradigm in Conversational Search Interactions

JR Trippas, L Gallagher, J Mackenzie - Proceedings of the 33rd ACM …, 2024 - dl.acm.org
Conversational assistants are becoming prevalent among the wider population due to their
simplicity and increasing utility. However, the shortcomings of these tools are as renowned …

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Q Chen, Y Chen, Y Chen, M Chen, Y Chen… - arXiv preprint arXiv …, 2025 - arxiv.org
Recent advancements in large language models (LLMs) and multimodal speech-text
models have laid the groundwork for seamless voice interactions, enabling real-time …

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

GT Lin, PG Shivakumar, A Gourav, Y Gu… - arXiv preprint arXiv …, 2024 - arxiv.org
While textless Spoken Language Models (SLMs) have shown potential in end-to-end
speech-to-speech modeling, they still lag behind text-based Large Language Models …

Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

W Kang, J Jia, C Wu, W Zhou, E Lakomkin… - arXiv preprint arXiv …, 2024 - arxiv.org
As speech becomes an increasingly common modality for interacting with large language
models (LLMs), it is becoming desirable to develop systems where LLMs can take into …

Intra-and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis

Z Jia, R Liu - arXiv preprint arXiv:2412.18733, 2024 - arxiv.org
Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue
history (MDH) to generate speech with appropriate conversational prosody for target …

Building a Taiwanese Mandarin Spoken Language Model: A First Attempt

CK Yang, YK Fu, CA Li, YC Lin, YX Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
This technical report presents our initial attempt to build a spoken large language model
(LLM) for Taiwanese Mandarin, specifically tailored to enable real-time, speech-to-speech …