A survey of resource-efficient llm and multimodal foundation models

M Xu, W Yin, D Cai, R Yi, D Xu, Q Wang, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

P Marcos-Manchón, R Alcover-Couso… - Proceedings of the …, 2024 - openaccess.thecvf.com
Diffusion models represent a new paradigm in text-to-image generation. Beyond generating
high-quality images from text prompts models such as Stable Diffusion have been …

Lime: Localized image editing via attention regularization in diffusion models

E Simsar, A Tonioni, Y Xian, T Hofmann… - arXiv preprint arXiv …, 2023 - arxiv.org
Diffusion models (DMs) have gained prominence due to their ability to generate high-quality,
varied images, with recent advancements in text-to-image generation. The research focus is …

Stable diffusion exposed: Gender bias from prompt to image

Y Wu, Y Nakashima, N Garcia - arXiv preprint arXiv:2312.03027, 2023 - arxiv.org
Recent studies have highlighted biases in generative models, shedding light on their
predisposition towards gender-based stereotypes and imbalances. This paper contributes to …

LocInv: Localization-aware Inversion for Text-Guided Image Editing

C Tang, K Wang, F Yang, J van de Weijer - arXiv preprint arXiv …, 2024 - arxiv.org
Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation
capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image …

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

D Yang, R Dong, J Ji, Y Ma, H Wang, X Sun… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, diffusion models have increasingly demonstrated their capabilities in vision
understanding. By leveraging prompt-based learning to construct sentences, these models …

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Z Zhu, X Feng, D Chen, J Yuan, C Qiao… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we explore the visual representations produced from a pre-trained text-to-
video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent …

Do text-free diffusion models learn discriminative visual representations?

S Mukhopadhyay, M Gwilliam, Y Yamaguchi… - arXiv preprint arXiv …, 2023 - arxiv.org
While many unsupervised learning models focus on one family of tasks, either generative or
discriminative, we explore the possibility of a unified representation learner: a model which …

InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models

N Saini, N Bodla, A Shrivastava… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce InVi, an approach for inserting or replacing objects within videos (referred to
as inpainting) using off-the-shelf, text-to-image latent diffusion models. InVi targets controlled …

GazeHTA: End-to-end Gaze Target Detection with Head-Target Association

ZY Lin, JY Chew, J van Gemert, X Zhang - arXiv preprint arXiv:2404.10718, 2024 - arxiv.org
We propose an end-to-end approach for gaze target detection: predicting a head-target
connection between individuals and the target image regions they are looking at. Most of the …