Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arXiv preprint arXiv …, 2024 - arxiv.org
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its
progression has been hindered by challenges in comprehending fine-grained visual content …

Finecliper: Multi-modal fine-grained clip for dynamic facial expression recognition with adapters

H Chen, H Huang, J Dong, M Zheng… - Proceedings of the 32nd …, 2024 - dl.acm.org
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human
behavior. However, current methods exhibit limited performance mainly due to the …

Moviedreamer: Hierarchical generation for coherent long visual sequence

C Zhao, M Liu, W Wang, W Chen, F Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in video generation have primarily leveraged diffusion models for
short-duration content. However, these approaches often fall short in modeling complex …

Diffusion feedback helps clip see better

W Wang, Q Sun, F Zhang, Y Tang, J Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world
representations across domains and modalities, has become a foundation for a variety of …

E5-v: Universal embeddings with multimodal large language models

T Jiang, M Song, Z Zhang, H Huang, W Deng… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) have shown promising advancements in
general visual and language understanding. However, the representation of multimodal …

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction

L Xing, Q Huang, X Dong, J Lu, P Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
In large vision-language models (LVLMs), images serve as inputs that carry a wealth of
information. As the idiom" A picture is worth a thousand words" implies, representing a …

Var-clip: Text-to-image generator with visual auto-regressive modeling

Q Zhang, X Dai, N Yang, X An, Z Feng… - arXiv preprint arXiv …, 2024 - arxiv.org
VAR is a new generation paradigm that employs' next-scale prediction'as opposed to'next-
token prediction'. This innovative transformation enables auto-regressive (AR) transformers …

Naturalbench: Evaluating vision-language models on natural adversarial samples

B Li, Z Lin, W Peng, JD Nyandwi, D Jiang, Z Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …

Cross-attention makes inference cumbersome in text-to-image diffusion models

W Zhang, H Liu, J Xie, F Faccio, MZ Shou… - arXiv preprint arXiv …, 2024 - arxiv.org
This study explores the role of cross-attention during inference in text-conditional diffusion
models. We find that cross-attention outputs converge to a fixed point after few inference …