Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

M Wang, Y Wang, TT Vu, E Shareghi… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in multimodal large language models (MLLMs) have made significant
progress in integrating information across various modalities, yet real-world applications in …

KISA: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations

L Kou, F Ni, Y Zheng, J Liu, Y Yuan, Z Dong… - Forty-first International … - openreview.net
Robotic manipulation tasks often span over long horizons and encapsulate multiple
subtasks with different skills. Learning policies directly from long-horizon demonstrations is …

TinyMem: Condensing Multimodal Memory for Long-form Video Action Detection

R Tian, Q Dai, H Hu, Z Wu - openreview.net
Despite the great advances in video understanding with deep neural networks, current
solutions still struggle with input videos that last for minutes, if not hours. To mitigate this …