Pix2struct: Screenshot parsing as pretraining for visual language understanding

K Lee, M Joshi, IR Turc, H Hu, F Liu… - International …, 2023 - proceedings.mlr.press
Visually-situated language is ubiquitous—sources range from textbooks with diagrams to
web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to …

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

G Li, N Duan, Y Fang, M Gong, D Jiang - Proceedings of the AAAI …, 2020 - aaai.org
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of
vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained …

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

J Li, D Li, C Xiong, S Hoi - International conference on …, 2022 - proceedings.mlr.press
Abstract Vision-Language Pre-training (VLP) has advanced the performance for many vision-
language tasks. However, most existing pre-trained models only excel in either …

Lion: Empowering multimodal large language model with dual-level visual knowledge

G Chen, L Shen, R Shao, X Deng… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability
to perceive and understand multi-modal signals. However most of the existing MLLMs …

Hrvda: High-resolution visual document assistant

C Liu, K Yin, H Cao, X Jiang, X Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
Leveraging vast training data multimodal large language models (MLLMs) have
demonstrated formidable general visual comprehension capabilities and achieved …

Vila: On pre-training for visual language models

J Lin, H Yin, W Ping, P Molchanov… - Proceedings of the …, 2024 - openaccess.thecvf.com
Visual language models (VLMs) rapidly progressed with the recent success of large
language models. There have been growing efforts on visual instruction tuning to extend the …

Bootstrapping vision-language learning with decoupled language pre-training

Y Jian, C Gao, S Vosoughi - Advances in Neural …, 2024 - proceedings.neurips.cc
We present a novel methodology aimed at optimizing the application of frozen large
language models (LLMs) for resource-intensive vision-language (VL) pre-training. The …

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

J Lu, D Batra, D Parikh, S Lee - Advances in neural …, 2019 - proceedings.neurips.cc
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-
agnostic joint representations of image content and natural language. We extend the …

Unifying vision-and-language tasks via text generation

J Cho, J Lei, H Tan, M Bansal - International Conference on …, 2021 - proceedings.mlr.press
Existing methods for vision-and-language learning typically require designing task-specific
architectures and objectives for each task. For example, a multi-label answer classifier for …

Unified language-vision pretraining with dynamic discrete visual tokenization

Y Jin, K Xu, L Chen, C Liao, J Tan, B Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
Recently, the remarkable advance of the Large Language Model (LLM) has inspired
researchers to transfer its extraordinary reasoning capability to data across several …