Masked modeling for self-supervised representation learning on vision and beyond

S Li, L Zhang, Z Wang, D Wu, L Wu, Z Liu, J Xia… - arXiv preprint arXiv …, 2023 - arxiv.org
As the deep learning revolution marches on, self-supervised learning has garnered
increasing attention in recent years thanks to its remarkable representation learning ability …

[PDF][PDF] Visionllama: A unified llama backbone for vision tasks

X Chu, J Su, B Zhang, C Shen - European Conference on Computer …, 2024 - ecva.net
We all know that large language models are built on top of a transformer-based architecture
to process textual inputs. For example, the LLaMA family of models stands out among many …

Multimodal pathway: Improve transformers with irrelevant data from other modalities

Y Zhang, X Ding, K Gong, Y Ge… - Proceedings of the …, 2024 - openaccess.thecvf.com
We propose to improve transformers of a specific modality with irrelevant data from other
modalities eg improve an ImageNet model with audio or point cloud datasets. We would like …

HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

L Sun, Z Lian, B Liu, J Tao - Information Fusion, 2024 - Elsevier
Abstract Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in
recent years for its critical role in creating emotion-aware intelligent machines. Previous …

Pixmim: Rethinking pixel reconstruction in masked image modeling

Y Liu, S Zhang, J Chen, K Chen, D Lin - arXiv preprint arXiv:2303.02416, 2023 - arxiv.org
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked
Autoencoders (MAE) and BEiT. However, subsequent works have complicated the …

T3d: Towards 3d medical image understanding through vision-language pre-training

C Liu, C Ouyang, Y Chen, CC Quilodrán-Casas… - arXiv preprint arXiv …, 2023 - arxiv.org
Expert annotation of 3D medical image for downstream analysis is resource-intensive,
posing challenges in clinical applications. Visual self-supervised learning (vSSL), though …

VideoMAC: Video Masked Autoencoders Meet ConvNets

G Pei, T Chen, X Jiang, H Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recently the advancement of self-supervised learning techniques like masked
autoencoders (MAE) has greatly influenced visual representation learning for images and …

Zero-shot ecg classification with multimodal learning and test-time clinical knowledge enhancement

C Liu, Z Wan, C Ouyang, A Shah, W Bai… - arXiv preprint arXiv …, 2024 - arxiv.org
Electrocardiograms (ECGs) are non-invasive diagnostic tools crucial for detecting cardiac
arrhythmic diseases in clinical practice. While ECG Self-supervised Learning (eSSL) …

VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

X Chu, J Su, B Zhang, C Shen - arXiv preprint arXiv:2403.00522, 2024 - arxiv.org
Large language models are built on top of a transformer-based architecture to process
textual inputs. For example, the LLaMA stands out among many open-source …

Learning to Rank Patches for Unbiased Image Redundancy Reduction

Y Luo, Z Chen, P Zhou, Z Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Images suffer from heavy spatial redundancy because pixels in neighboring regions are
spatially correlated. Existing approaches strive to overcome this limitation by reducing less …