Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

ChatGPT-like large-scale foundation models for prognostics and health management: A survey and roadmaps

YF Li, H Wang, M Sun - Reliability Engineering & System Safety, 2024 - Elsevier
PHM technology is vital in industrial production and maintenance, identifying and predicting
potential equipment failures and damages. This enables proactive maintenance measures …

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

S Chen, X Lan, Y Yuan, Z Jie, L Ma - arXiv preprint arXiv:2411.18211, 2024 - arxiv.org
Rapid development of large language models (LLMs) has significantly advanced multimodal
large language models (LMMs), particularly in vision-language tasks. However, existing …

Large scale foundation models for intelligent manufacturing applications: a survey

H Zhang, SD Semujju, Z Wang, X Lv, K Xu… - Journal of Intelligent …, 2025 - Springer
Although the applications of artificial intelligence especially deep learning have greatly
improved various aspects of intelligent manufacturing, they still face challenges for broader …

An Efficient Product-Customization Framework Based on Multimodal Data under the Social Manufacturing Paradigm

Y Li, H Wu, TS Tamir, Z Shen, S Liu, B Hu, G Xiong - Machines, 2023 - mdpi.com
With improvements in social productivity and technology, along with the popularity of the
Internet, consumer demands are becoming increasingly personalized and diversified …

CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios

X Qiao, X Li, X Qu, J Zhang, Y Liu, Y Luo, C Jin… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-Language Models pre-trained on large-scale image-text datasets have shown
superior performance in downstream tasks such as image retrieval. Most of the images for …

BagFormer: Better cross-modal retrieval via bag-wise interaction

H Hou, X Yan, Y Zhang - Engineering Applications of Artificial Intelligence, 2024 - Elsevier
In the field of cross-modal retrieval, single encoder models tend to perform better than dual
encoder models, but they suffer from high latency and low throughput. In this paper, we …

Enhanced image-text retrieval based on CLIP with YOLOv10 and Next-ViT

X Qian, B Liu - … on Computer Vision, Application, and Algorithm …, 2025 - spiedigitallibrary.org
In recent years, the CLIP model has achieved remarkable success in image-text retrieval
tasks through contrastive learning. However, CLIP still exhibits certain limitations when …

Chinese image description evaluation method based on target domain semantic constraints

Z Wang, W Sun, Z Wang, L Yang - … Conference on Image …, 2023 - spiedigitallibrary.org
To address the problems of insufficient accuracy and difficulty of application in the current
Chinese image description field, this paper proposes an evaluation method based on …