Recent Advances in Speech Language Models: A Survey

W Cui, D Yu, X Jiao, Z Meng, G Zhang, Q Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …

Emova: Empowering language models to see, hear and speak with vivid emotions

K Chen, Y Gou, R Huang, Z Liu, D Tan, J Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …

Mini-omni2: Towards open-source gpt-4o model with vision, speech and duplex

Z Xie, C Wu - arXiv preprint arXiv:2410.11190, 2024 - arxiv.org
GPT4o, an all-encompassing model, represents a milestone in the development of multi-
modal large models. It can understand visual, auditory, and textual modalities, directly output …

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

GT Lin, PG Shivakumar, A Gourav, Y Gu… - arXiv preprint arXiv …, 2024 - arxiv.org
While textless Spoken Language Models (SLMs) have shown potential in end-to-end
speech-to-speech modeling, they still lag behind text-based Large Language Models …

VoiceBench: Benchmarking LLM-Based Voice Assistants

Y Chen, X Yue, C Zhang, X Gao, RT Tan… - arXiv preprint arXiv …, 2024 - arxiv.org
Building on the success of large language models (LLMs), recent advancements such as
GPT-4o have enabled real-time speech interactions through LLM-based voice assistants …

Enabling Real-Time Conversations with Minimal Training Costs

W Xu, S Wang, W Zhao, X Han, Y Yan, Y Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have demonstrated the ability to improve human efficiency
through conversational interactions. Conventional LLM-powered dialogue systems …

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

A Dao, DB Vu, HH Ha - arXiv preprint arXiv:2410.15316, 2024 - arxiv.org
Large Language Models (LLMs) have revolutionized natural language processing, but their
application to speech-based tasks remains challenging due to the complexities of …

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

X Wang, Y Li, C Fu, L Xie, K Li, X Sun, L Ma - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of large language models has brought many new smart applications,
especially the excellent multimodal human-computer interaction in GPT-4o has brought …

[PDF][PDF] Continuous or Discrete, That Is the Question: A Survey on Large Multi-Modal Models from the Perspective of Input-Output Space Extension

Z Li, J Zhang, D Wang, Y Wang, X Huang, Z Wei - 2024 - preprints.org
With the success of large language models (LLMs) driving progress towards general-
purpose AI, there has been a growing focus on extending these models to multi-modal …