OPT: Omni-perception pre-trainer for cross-modal understanding and generation

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

被引用次数：188 相关文章所有 8 个版本

[PDF] springer.com

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer

In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

被引用次数：215 相关文章所有 10 个版本

[PDF] thecvf.com

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com

As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

被引用次数：268 相关文章所有 9 个版本

问答ChatGPT 之后: 超大预训练模型的机遇和挑战

卢经纬，郭超，戴星原，缪青海，王兴霞，杨静，王飞跃 - 自动化学报, 2023 - aas.net.cn

超大预训练模型(Pre-trained model, PTM) 是人工智能领域近年来迅速崛起的研究方向,
在自然语言处理(Natural language processing, NLP) 和计算机视觉等多种任务中达到了有史 …

被引用次数：25 相关文章所有 3 个版本

[PDF] thecvf.com

Omnivec: Learning robust representations with cross modal sharing

S Srivastava, G Sharma - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com

Majority of research in learning based methods has been towards designing and training
networks for specific tasks. However, many of the learning based tasks, across modalities …

被引用次数：68 相关文章所有 5 个版本

[PDF] arxiv.org

VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Q Zhu, L Zhou, Z Zhang, S Liu, B Jiao… - IEEE Transactions …, 2023 - ieeexplore.ieee.org

Although speech is a simple and effective way for humans to communicate with the outside
world, a more realistic speech interaction contains multimodal information, eg, vision, text …

被引用次数：36 相关文章所有 3 个版本

[PDF] arxiv.org

Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org

Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

被引用次数：62 相关文章所有 2 个版本

[PDF] arxiv.org

Detecting multimedia generated by large ai models: A survey

L Lin, N Gupta, Y Zhang, H Ren, CH Liu, F Ding… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid advancement of Large AI Models (LAIMs), particularly diffusion models and large
language models, has marked a new era where AI-generated multimedia is increasingly …

被引用次数：38 相关文章所有 4 个版本

[PDF] arxiv.org

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arXiv preprint arXiv:2304.01008, 2023 - arxiv.org

Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

被引用次数：39 相关文章所有 2 个版本

[PDF] arxiv.org

LLMDet: A third party large language models generated text detection tool

K Wu, L Pang, H Shen, X Cheng, TS Chua - arXiv preprint arXiv …, 2023 - arxiv.org

Generated texts from large language models (LLMs) are remarkably close to high-quality
human-authored text, raising concerns about their potential misuse in spreading false …

被引用次数：30 相关文章所有 5 个版本

高级搜索

QQ 群