Recognize anything: A strong image tagging model

Y Zhang, X Huang, J Ma, Z Li, Z Luo… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We present the Recognize Anything Model (RAM): a strong foundation model for
image tagging. RAM makes a substantial step for foundation models in computer vision …

Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model

S Smith, M Patwary, B Norick, P LeGresley… - arXiv preprint arXiv …, 2022 - arxiv.org
Pretrained general-purpose language models can achieve state-of-the-art accuracies in
various natural language processing domains by adapting to downstream tasks via zero …

Comprehending and ordering semantics for image captioning

Y Li, Y Pan, T Yao, T Mei - … of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com
Comprehending the rich semantics in an image and ordering them in linguistic order are
essential to compose a visually-grounded and linguistically coherent description for image …

Dualcoop: Fast adaptation to multi-label recognition with limited annotations

X Sun, P Hu, K Saenko - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Solving multi-label recognition (MLR) for images in the low-label regime is a challenging
task with many real-world applications. Recent work learns an alignment between textual …

Estimating noise transition matrix with label correlations for noisy multi-label learning

S Li, X Xia, H Zhang, Y Zhan… - Advances in Neural …, 2022 - proceedings.neurips.cc
In label-noise learning, the noise transition matrix, bridging the class posterior for noisy and
clean data, has been widely exploited to learn statistically consistent classifiers. The …

Which tokens to use? investigating token reduction in vision transformers

JB Haurum, S Escalera, GW Taylor… - Proceedings of the …, 2023 - openaccess.thecvf.com
Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs
more efficient by removing redundant information in the processed tokens. While different …

Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition

C Luo, S Song, W Xie, L Shen, H Gunes - arXiv preprint arXiv:2205.01782, 2022 - arxiv.org
The activations of Facial Action Units (AUs) mutually influence one another. While the
relationship between a pair of AUs can be complex and unique, existing approaches fail to …

Large loss matters in weakly supervised multi-label classification

Y Kim, JM Kim, Z Akata, J Lee - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Weakly supervised multi-label classification (WSML) task, which is to learn a multi-label
classification using partially observed labels per image, is becoming increasingly important …

Texts as images in prompt tuning for multi-label image recognition

Z Guo, B Dong, Z Ji, J Bai, Y Guo… - Proceedings of the …, 2023 - openaccess.thecvf.com
Prompt tuning has been employed as an efficient way to adapt large vision-language pre-
trained models (eg CLIP) to various downstream tasks in data-limited or label-limited …

Natural language-assisted sign language recognition

R Zuo, F Wei, B Mak - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Sign languages are visual languages which convey information by signers' handshape,
facial expression, body movement, and so forth. Due to the inherent restriction of …