Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Y Qiao, H Duan, X Fang, J Yang, L Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide
array of visual questions, which requires strong perception and reasoning faculties …

Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam

NC Mendonça - ACM Transactions on Computing Education, 2024 - dl.acm.org
The recent integration of visual capabilities into Large Language Models (LLMs) has the
potential to play a pivotal role in science and technology education, where visual elements …

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

X Cao, B Lai, W Ye, Y Ma, J Heintz, J Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, Multimodal Large Language Models (MLLMs) have shown great promise in
language-guided perceptual tasks such as recognition, segmentation, and object detection …

Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything

X Zou, Y Chen - arXiv preprint arXiv:2407.02534, 2024 - arxiv.org
Large Visual Language Models (VLMs) such as GPT-4 have achieved remarkable success
in generating comprehensive and nuanced responses, surpassing the capabilities of large …