- 学术资源搜索

Visual attention network

MH Guo, CZ Lu, ZN Liu, MM Cheng, SM Hu - Computational Visual Media, 2023 - Springer

While originally designed for natural language processing tasks, the self-attention
mechanism has recently taken various computer vision areas by storm. However, the 2D …

被引用次数：502 相关文章所有 8 个版本

[PDF] acm.org

Blended latent diffusion

O Avrahami, O Fried, D Lischinski - ACM transactions on graphics (TOG), 2023 - dl.acm.org

The tremendous progress in neural image generation, coupled with the emergence of
seemingly omnipotent vision-language models has finally enabled text-based interfaces for …

被引用次数：229 相关文章所有 3 个版本

[PDF] neurips.cc

Video pretraining (vpt): Learning to act by watching unlabeled online videos

B Baker, I Akkaya, P Zhokov… - Advances in …, 2022 - proceedings.neurips.cc

Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for
training models with broad, general capabilities for text, images, and other modalities …

被引用次数：203 相关文章所有 6 个版本

[PDF] mlr.press

Mitigating neural network overconfidence with logit normalization

H Wei, R Xie, H Cheng, L Feng… - … conference on machine …, 2022 - proceedings.mlr.press

Detecting out-of-distribution inputs is critical for the safe deployment of machine learning
models in the real world. However, neural networks are known to suffer from the …

被引用次数：195 相关文章所有 4 个版本

[PDF] mlr.press

Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting

T Zhou, Z Ma, Q Wen, X Wang… - … on machine learning, 2022 - proceedings.mlr.press

Long-term time series forecasting is challenging since prediction accuracy tends to
decrease dramatically with the increasing horizon. Although Transformer-based methods …

被引用次数：883 相关文章所有 4 个版本

[PDF] mlr.press

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

J Li, D Li, C Xiong, S Hoi - International conference on …, 2022 - proceedings.mlr.press

Abstract Vision-Language Pre-training (VLP) has advanced the performance for many vision-
language tasks. However, most existing pre-trained models only excel in either …

被引用次数：2463 相关文章所有 5 个版本

[PDF] thecvf.com

A convnet for the 2020s

Z Liu, H Mao, CY Wu, C Feichtenhofer… - Proceedings of the …, 2022 - openaccess.thecvf.com

The" Roaring 20s" of visual recognition began with the introduction of Vision Transformers
(ViTs), which quickly superseded ConvNets as the state-of-the-art image classification …

被引用次数：4559 相关文章所有 11 个版本

[PDF] neurips.cc

Cogview2: Faster and better text-to-image generation via hierarchical transformers

M Ding, W Zheng, W Hong… - Advances in Neural …, 2022 - proceedings.neurips.cc

Abstract Development of transformer-based text-to-image models is impeded by its slow
generation and complexity, for high-resolution images. In this work, we put forward a …

被引用次数：232 相关文章所有 5 个版本

[PDF] thecvf.com

Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction

Z Yang, X Gao, W Zhou, S Jiao… - Proceedings of the …, 2024 - openaccess.thecvf.com

Implicit neural representation has paved the way for new approaches to dynamic scene
reconstruction. Nonetheless cutting-edge dynamic neural rendering methods rely heavily on …

被引用次数：109 相关文章所有 3 个版本

[PDF] arxiv.org

Mass-editing memory in a transformer

K Meng, AS Sharma, A Andonian, Y Belinkov… - arXiv preprint arXiv …, 2022 - arxiv.org

Recent work has shown exciting promise in updating large language models with new
memories, so as to replace obsolete information or add specialized knowledge. However …

被引用次数：257 相关文章所有 5 个版本

高级搜索

QQ 群

Visual attention network

Blended latent diffusion

Video pretraining (vpt): Learning to act by watching unlabeled online videos

Mitigating neural network overconfidence with logit normalization

Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

A convnet for the 2020s

Cogview2: Faster and better text-to-image generation via hierarchical transformers

Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction

Mass-editing memory in a transformer

引用