Large-scale video-language pre-training has made remarkable strides in advancing video- language understanding tasks. However, the heavy computational burden of video …
Vision-Language Large Models (VLMs) have become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, ie, throughput and …
J Cao, P Ye, S Li, C Yu, Y Tang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Vision-Language Transformers (VLTs) have shown great success recently but are meanwhile accompanied by heavy computation costs where a major reason can be …
Token compression expedites the training and inference of Vision Transformers (ViTs) by reducing the number of the redundant tokens, eg, pruning inattentive tokens or merging …
J Zhuang, L Lu, M Dai, R Hu, J Chen, Q Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance their perceptual capabilities by integrating visual and textual information. However, processing the massive number of …
With the advancement of large-scale language modeling techniques, large multimodal models combining visual encoders with large language models have demonstrated …
G Yu, Y Chen, J Xu - arXiv preprint arXiv:2409.01162, 2024 - arxiv.org
Recently, multimodal large language models (MM-LLMs) have achieved great success in many multimodal tasks, but their high computational costs limit their further promotion and …
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations. Composition understanding …
Efficient inference in Large Language Models (LLMs) is impeded by the growing memory demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache …