Unified coarse-to-fine alignment for video-text retrieval

Correlation-guided query-dependency calibration in video representation learning for temporal grounding

WJ Moon, S Hyun, SB Lee, JP Heo - CoRR, 2023 - openreview.net

Temporal Grounding is to identify specific moments or highlights from a video corresponding
to textual descriptions. Typical approaches in temporal grounding treat all video clips …

被引用次数：33 相关文章所有 2 个版本

[PDF] arxiv.org

Uncertainty-aware sign language video retrieval with probability distribution modeling

X Wu, H Li, Y Luo, X Cheng, X Zhuang, M Cao… - European Conference on …, 2024 - Springer

Sign language video retrieval plays a key role in facilitating information access for the deaf
community. Despite significant advances in video-text retrieval, the complexity and inherent …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

A simple llm framework for long-range video question-answering

C Zhang, T Lu, MM Islam, Z Wang, S Yu… - arXiv preprint arXiv …, 2023 - arxiv.org

We present LLoVi, a language-based framework for long-range video question-answering
(LVQA). Unlike prior long-range video understanding methods, which are often costly and …

被引用次数：60 相关文章所有 2 个版本

[PDF] arxiv.org

Tempme: Video temporal token merging for efficient text-video retrieval

L Shen, T Hao, S Zhao, Y Zhang, P Liu, Y Bao… - arXiv preprint arXiv …, 2024 - arxiv.org

Most text-video retrieval methods utilize the text-image pre-trained CLIP as a backbone,
incorporating complex modules that result in high computational overhead. As a result …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Muse: Mamba is efficient multi-scale learner for text-video retrieval

H Tang, M Cao, J Huang, R Liu, P Jin, G Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Text-Video Retrieval (TVR) aims to align and associate relevant video content with
corresponding natural language queries. Most existing TVR methods are based on large …

被引用次数：6 相关文章所有 4 个版本

[PDF] pkwyx.com

Kdpror: A knowledge-decoupling probabilistic framework for video-text retrieval

X Zhuang, H Li, X Cheng, Z Zhu, Y Xie… - European Conference on …, 2024 - Springer

Existing video-text retrieval methods predominantly focus on designing diverse cross-modal
interaction mechanisms between captions and videos. However, those approaches diverge …

被引用次数：2 相关文章所有 5 个版本

[PDF] arxiv.org

SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

L Jiang, M Wang, Z Li, Y Fang, W Zhou… - Proceedings of the 32nd …, 2024 - dl.acm.org

Different from traditional video retrieval, sign language retrieval is more biased towards
understanding the semantic information of human actions contained in video clips. Previous …

被引用次数：3 相关文章所有 5 个版本

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

T Hannan, MM Islam, T Seidl, G Bertasius - European Conference on …, 2024 - Springer

Locating specific moments within long videos (20–120 min) presents a significant challenge,
akin to finding a needle in a haystack. Adapting existing short video (5–30 s) grounding …

被引用次数：1 相关文章所有 2 个版本

[PDF] pkusz.edu.cn

[PDF][PDF] GPA: global and prototype alignment for audio-text retrieval

Y Xie, Z Zhu, X Zhuang, L Liang, Z Wang… - Proc. Interspeech …, 2024 - pkusz.edu.cn

Abstract Recent Audio-Text Retrieval (ATR) models have achieved progressive results,
which pursue semantic interaction upon audio and text pairs. To clarify this coarse-grained …

被引用次数：5 相关文章所有 3 个版本

Vlap: Efficient video-language alignment via frame prompting and distilling for video question answering

X Wang, J Liang, CK Wang, K Deng, Y Lou, MC Lin… - CoRR, 2023 - openreview.net

In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA
model addresses both efficient frame sampling and effective cross-modal alignment in a …

被引用次数：5 相关文章所有 2 个版本

高级搜索

QQ 群