Scaling open-vocabulary image segmentation with image-level labels

G Ghiasi, X Gu, Y Cui, TY Lin - European Conference on Computer Vision, 2022 - Springer
We design an open-vocabulary image segmentation model to organize an image into
meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite …

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arXiv preprint arXiv …, 2021 - arxiv.org
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

Text2mesh: Text-driven neural stylization for meshes

O Michel, R Bar-On, R Liu, S Benaim… - Proceedings of the …, 2022 - openaccess.thecvf.com
In this work, we develop intuitive controls for editing the style of 3D objects. Our framework,
Text2Mesh, stylizes a 3D mesh by predicting color and local geometric details which …

Stable bias: Evaluating societal representations in diffusion models

S Luccioni, C Akiki, M Mitchell… - Advances in Neural …, 2024 - proceedings.neurips.cc
As machine learning-enabled Text-to-Image (TTI) systems are becoming increasingly
prevalent and seeing growing adoption as commercial services, characterizing the social …

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arXiv preprint arXiv:2304.01008, 2023 - arxiv.org
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

What does clip know about a red circle? visual prompt engineering for vlms

A Shtedritski, C Rupprecht… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot classification to text …

Stable bias: Analyzing societal representations in diffusion models

AS Luccioni, C Akiki, M Mitchell, Y Jernite - arXiv preprint arXiv …, 2023 - arxiv.org
As machine learning-enabled Text-to-Image (TTI) systems are becoming increasingly
prevalent and seeing growing adoption as commercial services, characterizing the social …

Effective conditioned and composed image retrieval combining clip-based features

A Baldrati, M Bertini, T Uricchio… - Proceedings of the …, 2022 - openaccess.thecvf.com
Conditioned and composed image retrieval extend CBIR systems by combining a query
image with an additional text that expresses the intent of the user, describing additional …

A systematic survey of prompt engineering on vision-language foundation models

J Gu, Z Han, S Chen, A Beirami, B He, G Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Prompt engineering is a technique that involves augmenting a large pre-trained model with
task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be …

Iti-gen: Inclusive text-to-image generation

C Zhang, X Chen, S Chai, CH Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Text-to-image generative models often reflect the biases of the training data, leading to
unequal representations of underrepresented groups. This study investigates inclusive text …