A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

P Papalampidi, S Koppula, S Pathak… - Proceedings of the …, 2024 - openaccess.thecvf.com
Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …

Upop: Unified and progressive pruning for compressing vision-language transformers

D Shi, C Tao, Y Jin, Z Yang, C Yuan… - … on Machine Learning, 2023 - proceedings.mlr.press
Real-world data contains a vast amount of multimodal information, among which vision and
language are the two most representative modalities. Moreover, increasingly heavier …

TESTA: Temporal-spatial token aggregation for long-form video-language understanding

S Ren, S Chen, S Li, X Sun, L Hou - arXiv preprint arXiv:2310.19060, 2023 - arxiv.org
Large-scale video-language pre-training has made remarkable strides in advancing video-
language understanding tasks. However, the heavy computational burden of video …

Membridge: Video-language pre-training with memory-augmented inter-modality bridge

J Yang, X Li, M Zheng, Z Wang, Y Zhu… - … on Image Processing, 2023 - ieeexplore.ieee.org
Video-language pre-training has attracted considerable attention recently for its promising
performance on various downstream tasks. Most existing methods utilize the modality …

Enhancing video-language representations with structural spatio-temporal alignment

H Fei, S Wu, M Zhang, M Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
While pre-training large-scale video-language models (VLMs) has shown remarkable
potential for various downstream video-language tasks, existing VLMs can still suffer from …

Balance act: Mitigating hubness in cross-modal retrieval with query and gallery banks

Y Wang, X Jian, B Xue - arXiv preprint arXiv:2310.11612, 2023 - arxiv.org
In this work, we present a post-processing solution to address the hubness problem in cross-
modal retrieval, a phenomenon where a small number of gallery data points are frequently …

Mclf: A multi-grained contrastive learning framework for asr-robust spoken language understanding

Z Huang, D Chen, Z Zhu, X Cheng - Findings of the Association for …, 2023 - aclanthology.org
Enhancing the robustness towards Automatic Speech Recognition (ASR) errors is of great
importance for Spoken Language Understanding (SLU). Trending ASR-robust SLU systems …

Crossget: Cross-guided ensemble of tokens for accelerating vision-language transformers

D Shi, C Tao, A Rao, Z Yang, C Yuan… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent vision-language models have achieved tremendous progress far beyond what we
ever expected. However, their computational costs are also dramatically growing with rapid …

Few-shot Action Recognition with Captioning Foundation Models

X Wang, S Zhang, H Yuan, Y Zhang, C Gao… - arXiv preprint arXiv …, 2023 - arxiv.org
Transferring vision-language knowledge from pretrained multimodal foundation models to
various downstream tasks is a promising direction. However, most current few-shot action …

Bayes risk ctc: Controllable ctc alignment in sequence-to-sequence tasks

J Tian, B Yan, J Yu, C Weng, D Yu… - arXiv preprint arXiv …, 2022 - arxiv.org
Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence.
The Connectionist Temporal Classification (CTC) criterion is widely used in multiple …