相关文章- 学术资源搜索

Can i trust your answer? visually grounded video question answering

J Xiao, A Yao, Y Li, TS Chua - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com

We study visually grounded VideoQA in response to the emerging trends of utilizing
pretraining techniques for video-language understanding. Specifically by forcing vision …

被引用次数：15 相关文章所有 3 个版本

[PDF] thecvf.com

Omnivid: A generative framework for universal video understanding

J Wang, D Chen, C Luo, B He, L Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com

The core of video understanding tasks such as recognition captioning and tracking is to
automatically detect objects or actions in a video and analyze their temporal evolution …

被引用次数：4 相关文章所有 3 个版本

[PDF] thecvf.com

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

被引用次数：21 相关文章所有 6 个版本

[PDF] thecvf.com

Lecture presentations multimodal dataset: Towards understanding multimodality in educational videos

DW Lee, C Ahuja, PP Liang, S Natu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Many educational videos use slide presentations, a sequence of visual pages that contain
text and figures accompanied by spoken language, which are constructed and presented …

被引用次数：5 相关文章所有 3 个版本

[PDF] thecvf.com

End-to-end generative pretraining for multimodal video captioning

PH Seo, A Nagrani, A Arnab… - Proceedings of the …, 2022 - openaccess.thecvf.com

Recent video and language pretraining frameworks lack the ability to generate sentences.
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …

被引用次数：163 相关文章所有 6 个版本

[PDF] arxiv.org

Documentclip: Linking figures and main body text in reflowed documents

F Liu, H Tan, C Tensmeyer - arXiv preprint arXiv:2306.06306, 2023 - arxiv.org

Vision-language pretraining models have achieved great success in supporting multimedia
applications by understanding the alignments between images and text. While existing …

被引用次数：21 相关文章所有 2 个版本

[PDF] thecvf.com

Visual abductive reasoning

C Liang, W Wang, T Zhou… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Abductive reasoning seeks the likeliest possible explanation for partial observations.
Although abduction is frequently employed in human daily reasoning, it is rarely explored in …

被引用次数：43 相关文章所有 8 个版本

[PDF] arxiv.org

Videochat: Chat-centric video understanding

KC Li, Y He, Y Wang, Y Li, W Wang, P Luo… - arXiv preprint arXiv …, 2023 - arxiv.org

In this paper, we initiate an attempt of developing an end-to-end chat-centric video
understanding system, coined as VideoChat. It integrates video foundation models and …

被引用次数：294 相关文章所有 4 个版本

[PDF] thecvf.com

Actbert: Learning global-local video-text representations

L Zhu, Y Yang - Proceedings of the IEEE/CVF conference …, 2020 - openaccess.thecvf.com

In this paper, we introduce ActBERT for self-supervised learning of joint video-text
representations from unlabeled data. First, we leverage global action information to catalyze …

被引用次数：454 相关文章所有 10 个版本

[PDF] aaai.org

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

G Li, N Duan, Y Fang, M Gong, D Jiang - Proceedings of the AAAI …, 2020 - aaai.org

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of
vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained …

被引用次数：884 相关文章所有 12 个版本

高级搜索

QQ 群

Can i trust your answer? visually grounded video question answering

Omnivid: A generative framework for universal video understanding

Clippo: Image-and-language understanding from pixels only

Lecture presentations multimodal dataset: Towards understanding multimodality in educational videos

End-to-end generative pretraining for multimodal video captioning

Documentclip: Linking figures and main body text in reflowed documents

Visual abductive reasoning

Videochat: Chat-centric video understanding

Actbert: Learning global-local video-text representations

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

引用