M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

M Wang, J Xing, B Jiang, J Chen, J Mei, X Zuo… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with
the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction …

A Multimodal, Multi-Task Adapting Framework for Video Action Recognition

M Wang, J Xing, B Jiang, J Chen, J Mei, X Zuo… - Proceedings of the …, 2024 - ojs.aaai.org
Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with
the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction …

TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

J Lyu, J Wei, G Zeng, Z Li, E Xie, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Existing scene text spotters are designed to locate and transcribe texts from images.
However, it is challenging for a spotter to achieve precise detection and recognition of scene …

Recognizing Video Activities in the Wild via View-to-Scene Joint Learning

J Yu, Y Chen, X Wang, X Cheng… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Recognizing video actions in the wild is challenging for visual control systems. In-the-wild
videos show actions not seen in training data, recorded from various angles and scenes with …

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

Q Zhou, Y Hou, R Zhou, Y Li, JQ Wang, Z Wu… - Connection …, 2024 - Taylor & Francis
The canonical video action recognition methods usually label categories with numbers or
one-hot vectors and train neural networks to classify a fixed set of predefined categories …

Referring Atomic Video Action Recognition

K Peng, J Fu, K Yang, D Wen, Y Chen, R Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed
at identifying atomic actions of a particular person based on a textual description and the …

MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation

S Yun, R Masukawa, M Na, M Imani - arXiv preprint arXiv:2406.18815, 2024 - arxiv.org
In the context of escalating safety concerns across various domains, the tasks of Video
Anomaly Detection (VAD) and Video Anomaly Recognition (VAR) have emerged as critically …

Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting Process: Methodology and Benchmark

G Wu, Y Zhang, L Deng, J Zhang, T Chai - arXiv preprint arXiv:2406.09016, 2024 - arxiv.org
Fused Magnesium Furnace (FMF) is a crucial industrial equipment in the production of
magnesia, and anomaly detection plays a pivotal role in ensuring its efficient, stable, and …

[HTML][HTML] Advancing Human Motion Recognition with SkeletonCLIP++: Weighted Video Feature Integration and Enhanced Contrastive Sample Discrimination

L Yuan, Z He, Q Wang, L Xu - Sensors, 2024 - mdpi.com
This paper introduces 'SkeletonCLIP++', an extension of our prior work in human action
recognition, emphasizing the use of semantic information beyond traditional label-based …

AdaViPro: Region-based Adaptive Visual Prompt for Large-Scale Models Adapting

M Yang, Y Tian, L Zhang, X Liang, X Ran… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, prompt-based methods have emerged as a new alternativeparameter-efficient fine-
tuning'paradigm, which only fine-tunes a small number of additional parameters while …