The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content …
H Chen, H Huang, J Dong, M Zheng… - Proceedings of the 32nd …, 2024 - dl.acm.org
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the …
Recent advancements in video generation have primarily leveraged diffusion models for short-duration content. However, these approaches often fall short in modeling complex …
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of …
T Jiang, M Song, Z Zhang, H Huang, W Deng… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal …
In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom" A picture is worth a thousand words" implies, representing a …
Q Zhang, X Dai, N Yang, X An, Z Feng… - arXiv preprint arXiv …, 2024 - arxiv.org
VAR is a new generation paradigm that employs' next-scale prediction'as opposed to'next- token prediction'. This innovative transformation enables auto-regressive (AR) transformers …
Vision-language models (VLMs) have made significant progress in recent visual-question- answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …
This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference …