Unveiling Encoder-Free Vision-Language Models

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

被引用次数：57 相关文章所有 3 个版本

[PDF] arxiv.org

Sherl: Synthesizing high accuracy and efficient memory for resource-limited transfer learning

H Diao, B Wan, X Jia, Y Zhuge, Y Zhang, H Lu… - … on Computer Vision, 2025 - Springer

Parameter-efficient transfer learning (PETL) has emerged as a flourishing research field for
adapting large pre-trained models to downstream tasks, greatly reducing trainable …

被引用次数：1 相关文章所有 8 个版本

[PDF] arxiv.org

Frontiers in intelligent colonoscopy

GP Ji, J Liu, P Xu, N Barnes, FS Khan, S Khan… - arXiv preprint arXiv …, 2024 - arxiv.org

Colonoscopy is currently one of the most sensitive screening methods for colorectal cancer.
This study investigates the frontiers of intelligent colonoscopy techniques and their …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

相关文章所有 2 个版本

[PDF] arxiv.org

Interleaved-modal chain-of-thought

J Gao, Y Li, Z Cao, W Li - arXiv preprint arXiv:2411.19488, 2024 - arxiv.org

Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series
of intermediate reasoning steps before arriving at the final answer. However, when …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

H Li, C Tian, J Shao, X Zhu, Z Wang, J Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org

The remarkable success of Large Language Models (LLMs) has extended to the multimodal
domain, achieving outstanding performance in image understanding and generation …

相关文章所有 2 个版本

[PDF] arxiv.org

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

C Tao, S Su, X Zhu, C Zhang, Z Chen, J Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid advance of Large Language Models (LLMs) has catalyzed the development of
Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders …

相关文章所有 2 个版本

[PDF] arxiv.org

FastVLM: Efficient Vision Encoding for Vision Language Models

PKA Vasu, F Faghri, CL Li, C Koc, N True… - arXiv preprint arXiv …, 2024 - arxiv.org

Scaling the input image resolution is essential for enhancing the performance of Vision
Language Models (VLMs), particularly in text-rich image understanding tasks. However …

[PDF] arxiv.org

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

J Jain, Z Yang, H Shi, J Gao, J Yang - arXiv preprint arXiv:2412.09585, 2024 - arxiv.org

The standard practice for developing contemporary MLLMs is to feed features from vision
encoder (s) into the LLM and train with natural language supervision. In this work, we posit …

相关文章所有 2 个版本

[PDF] arxiv.org

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

J Yi, ST Wasim, Y Luo, M Naseer, J Gall - arXiv preprint arXiv:2412.18609, 2024 - arxiv.org

We present an efficient encoder-free approach for video-language understanding that
achieves competitive performance while significantly reducing computational overhead …

相关文章所有 2 个版本

高级搜索

QQ 群