Mope-clip: Structured pruning for efficient vision-language models with module-wise pruning error metric

H Lin, H Bai, Z Liu, L Hou, M Sun… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language pre-trained models have achieved impressive performance on various
downstream tasks. However their large model sizes hinder their utilization on platforms with …

TESTA: Temporal-spatial token aggregation for long-form video-language understanding

S Ren, S Chen, S Li, X Sun, L Hou - arXiv preprint arXiv:2310.19060, 2023 - arxiv.org
Large-scale video-language pre-training has made remarkable strides in advancing video-
language understanding tasks. However, the heavy computational burden of video …

Turbo: Informativity-driven acceleration plug-in for vision-language models

C Ju, H Wang, Z Li, X Chen, Z Zhai, W Huang… - arXiv preprint arXiv …, 2023 - arxiv.org
Vision-Language Large Models (VLMs) have become primary backbone of AI, due to the
impressive performance. However, their expensive computation costs, ie, throughput and …

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

J Cao, P Ye, S Li, C Yu, Y Tang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Vision-Language Transformers (VLTs) have shown great success recently but are
meanwhile accompanied by heavy computation costs where a major reason can be …

Token Compensator: Altering Inference Cost of Vision Transformer Without Re-tuning

S Jie, Y Tang, J Guo, ZH Deng, K Han… - European Conference on …, 2024 - Springer
Token compression expedites the training and inference of Vision Transformers (ViTs) by
reducing the number of the redundant tokens, eg, pruning inattentive tokens or merging …

ST: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming

J Zhuang, L Lu, M Dai, R Hu, J Chen, Q Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance their perceptual capabilities by
integrating visual and textual information. However, processing the massive number of …

Recoverable compression: A multimodal vision token recovery mechanism guided by text information

Y Chen, J Xu, XY Zhang, WZ Liu, YY Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
With the advancement of large-scale language modeling techniques, large multimodal
models combining visual encoders with large language models have demonstrated …

Balancing performance and efficiency: A multimodal large language model pruning method based image text interaction

G Yu, Y Chen, J Xu - arXiv preprint arXiv:2409.01162, 2024 - arxiv.org
Recently, multimodal large language models (MM-LLMs) have achieved great success in
many multimodal tasks, but their high computational costs limit their further promotion and …

NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality

C Tao, G Kwon, V Gunjal, H Yang, Z Cai… - arXiv preprint arXiv …, 2024 - arxiv.org
We study the capability of Video-Language (VidL) models in understanding compositions
between objects, attributes, actions and their relations. Composition understanding …

D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models

Z Wan, X Wu, Y Zhang, Y Xin, C Tao, Z Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org
Efficient inference in Large Language Models (LLMs) is impeded by the growing memory
demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache …