Visual attention network

MH Guo, CZ Lu, ZN Liu, MM Cheng, SM Hu - Computational Visual Media, 2023 - Springer
While originally designed for natural language processing tasks, the self-attention
mechanism has recently taken various computer vision areas by storm. However, the 2D …

Blended latent diffusion

O Avrahami, O Fried, D Lischinski - ACM transactions on graphics (TOG), 2023 - dl.acm.org
The tremendous progress in neural image generation, coupled with the emergence of
seemingly omnipotent vision-language models has finally enabled text-based interfaces for …

Video pretraining (vpt): Learning to act by watching unlabeled online videos

B Baker, I Akkaya, P Zhokov… - Advances in …, 2022 - proceedings.neurips.cc
Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for
training models with broad, general capabilities for text, images, and other modalities …

Mitigating neural network overconfidence with logit normalization

H Wei, R Xie, H Cheng, L Feng… - … conference on machine …, 2022 - proceedings.mlr.press
Detecting out-of-distribution inputs is critical for the safe deployment of machine learning
models in the real world. However, neural networks are known to suffer from the …

Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting

T Zhou, Z Ma, Q Wen, X Wang… - … on machine learning, 2022 - proceedings.mlr.press
Long-term time series forecasting is challenging since prediction accuracy tends to
decrease dramatically with the increasing horizon. Although Transformer-based methods …

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

J Li, D Li, C Xiong, S Hoi - International conference on …, 2022 - proceedings.mlr.press
Abstract Vision-Language Pre-training (VLP) has advanced the performance for many vision-
language tasks. However, most existing pre-trained models only excel in either …

A convnet for the 2020s

Z Liu, H Mao, CY Wu, C Feichtenhofer… - Proceedings of the …, 2022 - openaccess.thecvf.com
The" Roaring 20s" of visual recognition began with the introduction of Vision Transformers
(ViTs), which quickly superseded ConvNets as the state-of-the-art image classification …

Cogview2: Faster and better text-to-image generation via hierarchical transformers

M Ding, W Zheng, W Hong… - Advances in Neural …, 2022 - proceedings.neurips.cc
Abstract Development of transformer-based text-to-image models is impeded by its slow
generation and complexity, for high-resolution images. In this work, we put forward a …

Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction

Z Yang, X Gao, W Zhou, S Jiao… - Proceedings of the …, 2024 - openaccess.thecvf.com
Implicit neural representation has paved the way for new approaches to dynamic scene
reconstruction. Nonetheless cutting-edge dynamic neural rendering methods rely heavily on …

Mass-editing memory in a transformer

K Meng, AS Sharma, A Andonian, Y Belinkov… - arXiv preprint arXiv …, 2022 - arxiv.org
Recent work has shown exciting promise in updating large language models with new
memories, so as to replace obsolete information or add specialized knowledge. However …