From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Membership inference attacks on machine learning: A survey

H Hu, Z Salcic, L Sun, G Dobbie, PS Yu… - ACM Computing Surveys …, 2022 - dl.acm.org
Machine learning (ML) models have been widely applied to various applications, including
image classification, text generation, audio recognition, and graph data analysis. However …

Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arXiv preprint arXiv …, 2023 - arxiv.org
The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

Scaling up gans for text-to-image synthesis

M Kang, JY Zhu, R Zhang, J Park… - Proceedings of the …, 2023 - openaccess.thecvf.com
The recent success of text-to-image synthesis has taken the world by storm and captured the
general public's imagination. From a technical standpoint, it also marked a drastic change in …

Unified-io: A unified model for vision, language, and multi-modal tasks

J Lu, C Clark, R Zellers, R Mottaghi… - The Eleventh …, 2022 - openreview.net
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …

Visual attention network

MH Guo, CZ Lu, ZN Liu, MM Cheng, SM Hu - Computational Visual Media, 2023 - Springer
While originally designed for natural language processing tasks, the self-attention
mechanism has recently taken various computer vision areas by storm. However, the 2D …

Deep model reassembly

X Yang, D Zhou, S Liu, J Ye… - Advances in neural …, 2022 - proceedings.neurips.cc
In this paper, we explore a novel knowledge-transfer task, termed as Deep Model
Reassembly (DeRy), for general-purpose model reuse. Given a collection of heterogeneous …

On-device training under 256kb memory

J Lin, L Zhu, WM Chen, WC Wang… - Advances in Neural …, 2022 - proceedings.neurips.cc
On-device training enables the model to adapt to new data collected from the sensors by
fine-tuning a pre-trained model. Users can benefit from customized AI models without having …

Magicbrush: A manually annotated dataset for instruction-guided image editing

K Zhang, L Mo, W Chen, H Sun… - Advances in Neural …, 2024 - proceedings.neurips.cc
Text-guided image editing is widely needed in daily life, ranging from personal use to
professional applications such as Photoshop. However, existing methods are either zero …

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …