This monograph surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches …
A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …
Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks …
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce …
A Ramesh, P Dhariwal, A Nichol, C Chu… - arXiv preprint arXiv …, 2022 - 3dvar.com
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image …
JY Koh, D Fried… - Advances in Neural …, 2024 - proceedings.neurips.cc
We propose a method to fuse frozen text-only large language models (LLMs) with pre- trained image encoder and decoder models, by mapping between their embedding spaces …
K Lee, M Joshi, IR Turc, H Hu, F Liu… - International …, 2023 - proceedings.mlr.press
Visually-situated language is ubiquitous—sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to …
CS Xia, Y Wei, L Zhang - 2023 IEEE/ACM 45th International …, 2023 - ieeexplore.ieee.org
Automated Program Repair (APR) aims to help developers automatically patch software bugs. However, current state-of-the-art traditional and learning-based APR techniques face …