Deep generative models are becoming increasingly powerful, now generating diverse high fidelity photo-realistic samples given text prompts. Have they reached the point where …
Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to …
X Du, Y Sun, J Zhu, Y Li - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Utilizing auxiliary outlier datasets to regularize the machine learning model has demonstrated promise for out-of-distribution (OOD) detection and safe prediction. Due to the …
Benefiting from prompt tuning, recent years have witnessed the promising performance of pre-trained vision-language models, eg, CLIP, on versatile downstream tasks. In this paper …
Seismic advances in generative AI algorithms for imagery, text, and other data types has led to the temptation to use synthetic data to train next-generation models. Repeating this …
Semantic segmentation has witnessed tremendous progress due to the proposal of various advanced network architectures. However, they are extremely hungry for delicate …
Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …
The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) …
In this work, we investigate the problem of Model-Agnostic Zero-Shot Classification (MA- ZSC), which refers to training non-specific classification architectures (downstream models) …