In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era …
Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model …
Goal-conditioned policies for robotic navigation can be trained on large, unannotated datasets, providing for good generalization to real-world settings. However, particularly in …
Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely …
Z Lin, S Yu, Z Kuang, D Pathak… - Proceedings of the …, 2023 - openaccess.thecvf.com
The ability to quickly learn a new task with minimal instruction-known as few-shot learning-is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot …
Abstract In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models …
We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason …
H Ha, S Song - arXiv preprint arXiv:2207.11514, 2022 - arxiv.org
We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual …
Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily …