Uniter: Universal image-text representation learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed… - European conference on …, 2020 - Springer
Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …

[PDF][PDF] UNITER: UNiversal Image-TExt Representation Learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed, Z Gan… - njuhugn.github.io
Joint image-text embedding is the bedrock for most Visionand-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …

[PDF][PDF] UNITER: UNiversal Image-TExt Representation Learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed, Z Gan… - ecva.net
Joint image-text embedding is the bedrock for most Visionand-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …

[引用][C] UNITER: UNiversal Image-TExt Representation Learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed… - … Vision–ECCV 2020, 2020 - cir.nii.ac.jp
UNITER: UNiversal Image-TExt Representation Learning | CiNii Research CiNii 国立情報学
研究所 学術情報ナビゲータ[サイニィ] 詳細へ移動 検索フォームへ移動 論文・データをさがす 大学 …

UNITER: UNiversal Image-TExt Representation Learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed… - arXiv e …, 2019 - ui.adsabs.harvard.edu
Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …

UNITER: UNiversal Image-TExt Representation Learning

YC Chen, L Li, L Yu, AE Kholy, F Ahmed, Z Gan… - arXiv preprint arXiv …, 2019 - arxiv.org
Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …

UNITER: UNiversal Image-TExt Representation Learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed… - … on Computer Vision, 2020 - dl.acm.org
Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …