- 学术资源搜索

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：195 相关文章所有 7 个版本

[PDF] acm.org

A survey on causal inference

L Yao, Z Chu, S Li, Y Li, J Gao, A Zhang - ACM Transactions on …, 2021 - dl.acm.org

Causal inference is a critical research topic across many domains, such as statistics,
computer science, education, public policy, and economics, for decades. Nowadays …

被引用次数：592 相关文章所有 6 个版本

[PDF] github.io

Graph neural networks: foundation, frontiers and applications

L Wu, P Cui, J Pei, L Zhao, X Guo - … of the 28th ACM SIGKDD Conference …, 2022 - dl.acm.org

The field of graph neural networks (GNNs) has seen rapid and incredible strides over the
recent years. Graph neural networks, also known as deep learning on graphs, graph …

被引用次数：466 相关文章所有 11 个版本

[PDF] thecvf.com

Just ask: Learning to answer questions from millions of narrated videos

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2021 - openaccess.thecvf.com

Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …

被引用次数：320 相关文章所有 14 个版本

[PDF] arxiv.org

Lxmert: Learning cross-modality encoder representations from transformers

H Tan, M Bansal - arXiv preprint arXiv:1908.07490, 2019 - arxiv.org

Vision-and-language reasoning requires an understanding of visual concepts, language
semantics, and, most importantly, the alignment and relationships between these two …

被引用次数：2846 相关文章所有 4 个版本

[PDF] neurips.cc

Large-scale adversarial training for vision-and-language representation learning

Z Gan, YC Chen, L Li, C Zhu… - Advances in Neural …, 2020 - proceedings.neurips.cc

We present VILLA, the first known effort on large-scale adversarial training for vision-and-
language (V+ L) representation learning. VILLA consists of two training stages:(i) task …

被引用次数：550 相关文章所有 8 个版本

[PDF] neurips.cc

Breaking the dilemma of medical image-to-image translation

L Kong, C Lian, D Huang, Y Hu… - Advances in Neural …, 2021 - proceedings.neurips.cc

Abstract Supervised Pix2Pix and unsupervised Cycle-consistency are two modes that
dominate the field of medical image-to-image translation. However, neither modes are ideal …

被引用次数：149 相关文章所有 6 个版本

[PDF] thecvf.com

Counterfactual samples synthesizing for robust visual question answering

L Chen, X Yan, J Xiao, H Zhang… - Proceedings of the …, 2020 - openaccess.thecvf.com

Abstract Despite Visual Question Answering (VQA) has realized impressive progress over
the last few years, today's VQA models tend to capture superficial linguistic correlations in …

被引用次数：383 相关文章所有 8 个版本

[PDF] arxiv.org

Multimodal intelligence: Representation learning, information fusion, and applications

C Zhang, Z Yang, X He, L Deng - IEEE Journal of Selected …, 2020 - ieeexplore.ieee.org

Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …

被引用次数：429 相关文章所有 3 个版本

[PDF] neurips.cc

Coot: Cooperative hierarchical transformer for video-text representation learning

S Ging, M Zolfaghari, H Pirsiavash… - Advances in neural …, 2020 - proceedings.neurips.cc

Many real-world video-text tasks involve different levels of granularity, such as frames and
words, clip and sentences or videos and paragraphs, each with distinct semantics. In this …

被引用次数：205 相关文章所有 12 个版本

高级搜索

QQ 群

Vision-language pre-training: Basics, recent advances, and future trends

A survey on causal inference

Graph neural networks: foundation, frontiers and applications

Just ask: Learning to answer questions from millions of narrated videos

Lxmert: Learning cross-modality encoder representations from transformers

Large-scale adversarial training for vision-and-language representation learning

Breaking the dilemma of medical image-to-image translation

Counterfactual samples synthesizing for robust visual question answering

Multimodal intelligence: Representation learning, information fusion, and applications

Coot: Cooperative hierarchical transformer for video-text representation learning

引用