Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient …
H Liu, C Li, Y Li, YJ Lee - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this paper we present the first systematic study to investigate the design …
In this work, we present SEEM, a promotable and interactive model for segmenting everything everywhere all at once in an image. In SEEM, we propose a novel and versatile …
Although perception systems have made remarkable advancements in recent years they still rely on explicit human instruction or pre-defined categories to identify the target objects …
Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing …
Humans can easily solve multimodal tasks in context with only a few demonstrations or simple instructions which current multimodal systems largely struggle to imitate. In this work …
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and …
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as input two types of …
All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field …