Pre-trained models for natural language processing: A survey

X Qiu, T Sun, Y Xu, Y Shao, N Dai, X Huang - Science China …, 2020 - Springer
Recently, the emergence of pre-trained models (PTMs) has brought natural language
processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs …

Recent advances in deep learning based dialogue systems: A systematic survey

J Ni, T Young, V Pandelea, F Xue… - Artificial intelligence review, 2023 - Springer
Dialogue systems are a popular natural language processing (NLP) task as it is promising in
real-life applications. It is also a complicated task since many NLP tasks deserving study are …

[PDF][PDF] The dawn of lmms: Preliminary explorations with gpt-4v (ision)

Z Yang, L Li, K Lin, J Wang, CC Lin… - arXiv preprint arXiv …, 2023 - stableaiprompts.com
Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory
skills, such as visual understanding, to achieve stronger generic intelligence. In this paper …

Side adapter network for open-vocabulary semantic segmentation

M Xu, Z Zhang, F Wei, H Hu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper presents a new framework for open-vocabulary semantic segmentation with the
pre-trained vision-language model, named SAN. Our approach models the semantic …

Mm-vet: Evaluating large multimodal models for integrated capabilities

W Yu, Z Yang, L Li, J Wang, K Lin, Z Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose MM-Vet, an evaluation benchmark that examines large multimodal models
(LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing …

Glipv2: Unifying localization and vision-language understanding

H Zhang, P Zhang, X Hu, YC Chen… - Advances in …, 2022 - proceedings.neurips.cc
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks
(eg, object detection, instance segmentation) and Vision-Language (VL) understanding …

Groupvit: Semantic segmentation emerges from text supervision

J Xu, S De Mello, S Liu, W Byeon… - Proceedings of the …, 2022 - openaccess.thecvf.com
Grouping and recognition are important components of visual scene understanding, eg, for
object detection and semantic segmentation. With end-to-end deep learning systems …

Grounded language-image pre-training

LH Li, P Zhang, H Zhang, J Yang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
This paper presents a grounded language-image pre-training (GLIP) model for learning
object-level, language-aware, and semantic-rich visual representations. GLIP unifies object …

Clip-adapter: Better vision-language models with feature adapters

P Gao, S Geng, R Zhang, T Ma, R Fang… - International Journal of …, 2024 - Springer
Large-scale contrastive vision-language pretraining has shown significant progress in visual
representation learning. Unlike traditional visual systems trained by a fixed set of discrete …

Mdetr-modulated detection for end-to-end multi-modal understanding

A Kamath, M Singh, Y LeCun… - Proceedings of the …, 2021 - openaccess.thecvf.com
Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of
interest from the image. However, this crucial module is typically used as a black box …