Caltech-UCSD birds 200

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

被引用次数：327 相关文章所有 11 个版本

[PDF] acm.org

Membership inference attacks on machine learning: A survey

H Hu, Z Salcic, L Sun, G Dobbie, PS Yu… - ACM Computing Surveys …, 2022 - dl.acm.org

Machine learning (ML) models have been widely applied to various applications, including
image classification, text generation, audio recognition, and graph data analysis. However …

被引用次数：424 相关文章所有 6 个版本

[PDF] arxiv.org

Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arXiv preprint arXiv …, 2023 - arxiv.org

The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

被引用次数：1260 相关文章所有 11 个版本

[PDF] thecvf.com

Scaling up gans for text-to-image synthesis

M Kang, JY Zhu, R Zhang, J Park… - Proceedings of the …, 2023 - openaccess.thecvf.com

The recent success of text-to-image synthesis has taken the world by storm and captured the
general public's imagination. From a technical standpoint, it also marked a drastic change in …

被引用次数：349 相关文章所有 6 个版本

[PDF] openreview.net

Unified-io: A unified model for vision, language, and multi-modal tasks

J Lu, C Clark, R Zellers, R Mottaghi… - The Eleventh …, 2022 - openreview.net

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …

被引用次数：336 相关文章所有 3 个版本

[PDF] springer.com

Visual attention network

MH Guo, CZ Lu, ZN Liu, MM Cheng, SM Hu - Computational Visual Media, 2023 - Springer

While originally designed for natural language processing tasks, the self-attention
mechanism has recently taken various computer vision areas by storm. However, the 2D …

被引用次数：620 相关文章所有 8 个版本

[PDF] neurips.cc

Deep model reassembly

X Yang, D Zhou, S Liu, J Ye… - Advances in neural …, 2022 - proceedings.neurips.cc

In this paper, we explore a novel knowledge-transfer task, termed as Deep Model
Reassembly (DeRy), for general-purpose model reuse. Given a collection of heterogeneous …

被引用次数：132 相关文章所有 6 个版本

[PDF] neurips.cc

On-device training under 256kb memory

J Lin, L Zhu, WM Chen, WC Wang… - Advances in Neural …, 2022 - proceedings.neurips.cc

On-device training enables the model to adapt to new data collected from the sensors by
fine-tuning a pre-trained model. Users can benefit from customized AI models without having …

被引用次数：176 相关文章所有 8 个版本

[PDF] neurips.cc

Magicbrush: A manually annotated dataset for instruction-guided image editing

K Zhang, L Mo, W Chen, H Sun… - Advances in Neural …, 2024 - proceedings.neurips.cc

Text-guided image editing is widely needed in daily life, ranging from personal use to
professional applications such as Photoshop. However, existing methods are either zero …

被引用次数：94 相关文章所有 6 个版本

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

被引用次数：52 相关文章所有 3 个版本

高级搜索

QQ 群

From show to tell: A survey on deep learning-based image captioning

Membership inference attacks on machine learning: A survey

Dinov2: Learning robust visual features without supervision

Scaling up gans for text-to-image synthesis

Unified-io: A unified model for vision, language, and multi-modal tasks

Visual attention network

Deep model reassembly

On-device training under 256kb memory

Magicbrush: A manually annotated dataset for instruction-guided image editing

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

引用