Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com
As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

问答ChatGPT 之后: 超大预训练模型的机遇和挑战

卢经纬, 郭超, 戴星原, 缪青海, 王兴霞, 杨静, 王飞跃 - 自动化学报, 2023 - aas.net.cn
超大预训练模型(Pre-trained model, PTM) 是人工智能领域近年来迅速崛起的研究方向,
在自然语言处理(Natural language processing, NLP) 和计算机视觉等多种任务中达到了有史 …

Omnivec: Learning robust representations with cross modal sharing

S Srivastava, G Sharma - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
Majority of research in learning based methods has been towards designing and training
networks for specific tasks. However, many of the learning based tasks, across modalities …

VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Q Zhu, L Zhou, Z Zhang, S Liu, B Jiao… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Although speech is a simple and effective way for humans to communicate with the outside
world, a more realistic speech interaction contains multimodal information, eg, vision, text …

Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Detecting multimedia generated by large ai models: A survey

L Lin, N Gupta, Y Zhang, H Ren, CH Liu, F Ding… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advancement of Large AI Models (LAIMs), particularly diffusion models and large
language models, has marked a new era where AI-generated multimedia is increasingly …

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arXiv preprint arXiv:2304.01008, 2023 - arxiv.org
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

LLMDet: A third party large language models generated text detection tool

K Wu, L Pang, H Shen, X Cheng, TS Chua - arXiv preprint arXiv …, 2023 - arxiv.org
Generated texts from large language models (LLMs) are remarkably close to high-quality
human-authored text, raising concerns about their potential misuse in spreading false …