A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario

Z Song, Z Ma, Y Yang, J Zhuo, X Chen - arXiv preprint arXiv:2412.00721, 2024 - arxiv.org
Large Language Models (LLMs) have showcased exceptional performance across diverse
NLP tasks, and their integration with speech encoder is rapidly emerging as a dominant …

Hdmole: Mixture of lora experts with hierarchical routing and dynamic thresholds for fine-tuning llm-based asr models

B Mu, K Wei, Q Shao, Y Xu, L Xie - arXiv preprint arXiv:2409.19878, 2024 - arxiv.org
Recent advancements in integrating Large Language Models (LLM) with automatic speech
recognition (ASR) have performed remarkably in general domains. While supervised fine …

Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM

F Zhang, W Geng, H Huang, Y Shan, C Yi… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we introduce a speech-conditioned Large Language Model (LLM) integrated
with a Mixture of Experts (MoE) based connector to address the challenge of Code …

Advancing Multi-Talker ASR Performance With Large Language Models

M Shi, Z Jin, Y Xu, Y Xu, SX Zhang… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Recognizing overlapping speech from multiple speakers in conversational scenarios is one
of the most challenging problem for automatic speech recognition (ASR). Serialized output …

Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

Y Weng, H Wang, T Gao, K Li, S Niu, J Du - arXiv preprint arXiv …, 2024 - arxiv.org
In multimodal sentiment analysis, collecting text data is often more challenging than video or
audio due to higher annotation costs and inconsistent automatic speech recognition (ASR) …

Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model

Z Ma, Z Chen, Y Wang, ES Chng, X Chen - arXiv e-prints, 2025 - ui.adsabs.harvard.edu
Abstract Large Audio-Language Models (LALMs) have demonstrated remarkable
performance in tasks involving audio perception and understanding, such as speech …