NUWA-LIP: language-guided image inpainting with defect-free VQGAN

M Ni, X Li, W Zuo - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Abstract Language-guided image inpainting aims to fill the defective regions of an image
under the guidance of text while keeping the non-defective regions unchanged. However …

Utilizing greedy nature for multimodal conditional image synthesis in transformers

S Su, J Zhu, L Gao, J Song - IEEE Transactions on Multimedia, 2023 - ieeexplore.ieee.org
Multimodal Conditional Image Synthesis (MCIS) aims to generate images according to
different modalities input and their combination, which allows users to describe their …

Exploring efficient few-shot adaptation for vision transformers

C Xu, S Yang, Y Wang, Z Wang, Y Fu, X Xue - arXiv preprint arXiv …, 2023 - arxiv.org
The task of Few-shot Learning (FSL) aims to do the inference on novel categories containing
only few labeled examples, with the help of knowledge learned from base categories …

Human motionformer: Transferring human motions with vision transformers

H Liu, X Han, C Jin, L Qian, H Wei, Z Lin… - arXiv preprint arXiv …, 2023 - arxiv.org
Human motion transfer aims to transfer motions from a target dynamic person to a source
static one for motion synthesis. An accurate matching between the source person and the …

Asset: autoregressive semantic scene editing with transformers at high resolutions

D Liu, S Shetty, T Hinz, M Fisher, R Zhang… - ACM Transactions on …, 2022 - dl.acm.org
We present ASSET, a neural architecture for automatically modifying an input high-
resolution image according to a user's edits on its semantic segmentation map. Our …

Edibert, a generative model for image editing

T Issenhuth, U Tanielian, J Mary, D Picard - arXiv preprint arXiv …, 2021 - arxiv.org
Advances in computer vision are pushing the limits of im-age manipulation, with generative
models sampling detailed images on various tasks. However, a specialized model is often …

Elmformer: Efficient raw image restoration with a locally multiplicative transformer

J Ma, S Yan, L Zhang, G Wang, Q Zhang - Proceedings of the 30th ACM …, 2022 - dl.acm.org
In order to get raw images of high quality for downstream Image Signal Process (ISP), in this
paper we present an Efficient Locally Multiplicative Transformer called ELMformer for raw …

ViR: Vision Retention Networks

A Hatamizadeh, M Ranzinger, J Kautz - arXiv preprint arXiv:2310.19731, 2023 - arxiv.org
Vision Transformers (ViTs) have attracted a lot of popularity in recent years, due to their
exceptional capabilities in modeling long-range spatial dependencies and scalability for …

QS-Craft: Learning to Quantize, Scrabble and Craft for Conditional Human Motion Animation

Y Hong, X Qian, S Luo, G Guo… - Proceedings of the …, 2022 - openaccess.thecvf.com
This paper studies the task of conditional Human Motion Animation (cHMA). Given a source
image and a driving video, the model should animate the new frame sequence, in which the …

Shaken, and Stirred: Long-Range Dependencies Enable Robust Outlier Detection with PixelCNN++

BM Umapathi, K Chauhan, P Shenoy… - arXiv preprint arXiv …, 2022 - arxiv.org
Reliable outlier detection is critical for real-world deployment of deep learning models.
Although extensively studied, likelihoods produced by deep generative models have been …