Delving into out-of-distribution detection with vision-language representations

Y Ming, Z Cai, J Gu, Y Sun, W Li… - Advances in neural …, 2022 - proceedings.neurips.cc
Recognizing out-of-distribution (OOD) samples is critical for machine learning systems
deployed in the open world. The vast majority of OOD detection methods are driven by a …

Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis

P Esser, R Rombach, A Blattmann… - Advances in neural …, 2021 - proceedings.neurips.cc
Autoregressive models and their sequential factorization of the data likelihood have recently
demonstrated great potential for image representation and synthesis. Nevertheless, they …

Tech: Text-guided reconstruction of lifelike clothed humans

Y Huang, H Yi, Y Xiu, T Liao, J Tang… - … Conference on 3D …, 2024 - ieeexplore.ieee.org
Despite recent research advancements in reconstructing clothed humans from a single
image, accurately restoring the “unseen regions” with high-level details remains an …

Ernie-vil 2.0: Multi-view contrastive learning for image-text pre-training

B Shan, W Yin, Y Sun, H Tian, H Wu… - arXiv preprint arXiv …, 2022 - arxiv.org
Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted
extensive attention from academia and industry due to their superior performance on various …

Diversity and bias in audio captioning datasets

I Martin Morato, A Mesaros - 2021 - trepo.tuni.fi
Describing soundscapes in sentences allows better understanding of the acoustic scene
than a single label indicating the acoustic scene class or a set of audio tags indicating the …

Dual-modal transformer with enhanced inter-and intra-modality interactions for image captioning

D Kumar, V Srivastava, DE Popescu, JD Hemanth - Applied Sciences, 2022 - mdpi.com
Image captioning is oriented towards describing an image with the best possible use of
words that can provide a semantic, relatable meaning of the scenario inscribed. Different …

JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models

Y Wada, K Kaneda, K Sugiura - arXiv preprint arXiv:2311.04192, 2023 - arxiv.org
Image captioning studies heavily rely on automatic evaluation metrics such as BLEU and
METEOR. However, such n-gram-based metrics have been shown to correlate poorly with …

Boosting generic visual-linguistic representation with dynamic contexts

G Ma, Y Bai, W Zhang, T Yao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Pretraining large models on generous multi-modal corpora has accelerated the
development of visual-linguistic (VL) representation and achieved great success on various …

COSMic: a coherence-aware generation metric for image descriptions

M Inan, P Sharma, B Khalid, R Soricut, M Stone… - arXiv preprint arXiv …, 2021 - arxiv.org
Developers of text generation models rely on automated evaluation metrics as a stand-in for
slow and expensive manual evaluations. However, image captioning metrics have struggled …

Describing image focused in cognitive and visual details for visually impaired people: An approach to generating inclusive paragraphs

DL Fernandes, MHF Ribeiro, FR Cerqueira… - arXiv preprint arXiv …, 2022 - arxiv.org
Several services for people with visual disabilities have emerged recently due to
achievements in Assistive Technologies and Artificial Intelligence areas. Despite the growth …