Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Sherl: Synthesizing high accuracy and efficient memory for resource-limited transfer learning

H Diao, B Wan, X Jia, Y Zhuge, Y Zhang, H Lu… - … on Computer Vision, 2025 - Springer
Parameter-efficient transfer learning (PETL) has emerged as a flourishing research field for
adapting large pre-trained models to downstream tasks, greatly reducing trainable …

Frontiers in intelligent colonoscopy

GP Ji, J Liu, P Xu, N Barnes, FS Khan, S Khan… - arXiv preprint arXiv …, 2024 - arxiv.org
Colonoscopy is currently one of the most sensitive screening methods for colorectal cancer.
This study investigates the frontiers of intelligent colonoscopy techniques and their …

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Interleaved-modal chain-of-thought

J Gao, Y Li, Z Cao, W Li - arXiv preprint arXiv:2411.19488, 2024 - arxiv.org
Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series
of intermediate reasoning steps before arriving at the final answer. However, when …

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

H Li, C Tian, J Shao, X Zhu, Z Wang, J Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org
The remarkable success of Large Language Models (LLMs) has extended to the multimodal
domain, achieving outstanding performance in image understanding and generation …

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

C Tao, S Su, X Zhu, C Zhang, Z Chen, J Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advance of Large Language Models (LLMs) has catalyzed the development of
Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders …

FastVLM: Efficient Vision Encoding for Vision Language Models

PKA Vasu, F Faghri, CL Li, C Koc, N True… - arXiv preprint arXiv …, 2024 - arxiv.org
Scaling the input image resolution is essential for enhancing the performance of Vision
Language Models (VLMs), particularly in text-rich image understanding tasks. However …

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

J Jain, Z Yang, H Shi, J Gao, J Yang - arXiv preprint arXiv:2412.09585, 2024 - arxiv.org
The standard practice for developing contemporary MLLMs is to feed features from vision
encoder (s) into the LLM and train with natural language supervision. In this work, we posit …

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

J Yi, ST Wasim, Y Luo, M Naseer, J Gall - arXiv preprint arXiv:2412.18609, 2024 - arxiv.org
We present an efficient encoder-free approach for video-language understanding that
achieves competitive performance while significantly reducing computational overhead …