Revisiting kernel temporal segmentation as an adaptive tokenizer for long-form video understanding

文章

学术资源搜索

获得 3 条结果（用时0.02秒）

我的图书馆

Revisiting kernel temporal segmentation as an adaptive tokenizer for long-form video understanding

在引用文章中搜索

[PDF] arxiv.org

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

M Wang, Y Wang, TT Vu, E Shareghi… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent advancements in multimodal large language models (MLLMs) have made significant
progress in integrating information across various modalities, yet real-world applications in …

被引用次数：1 相关文章

[PDF] openreview.net

KISA: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations

L Kou, F Ni, Y Zheng, J Liu, Y Yuan, Z Dong… - Forty-first International … - openreview.net

Robotic manipulation tasks often span over long horizons and encapsulate multiple
subtasks with different skills. Learning policies directly from long-horizon demonstrations is …

被引用次数：3 相关文章

[PDF] openreview.net

TinyMem: Condensing Multimodal Memory for Long-form Video Action Detection

R Tian, Q Dai, H Hu, Z Wu - openreview.net

Despite the great advances in video understanding with deep neural networks, current
solutions still struggle with input videos that last for minutes, if not hours. To mitigate this …

高级搜索

QQ 群

Revisiting kernel temporal segmentation as an adaptive tokenizer for long-form video understanding

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

KISA: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations

TinyMem: Condensing Multimodal Memory for Long-form Video Action Detection

引用