Improving pixel-based mim by reducing wasted modeling capability

S Li, L Zhang, Z Wang, D Wu, L Wu, Z Liu, J Xia… - arXiv preprint arXiv …, 2023 - arxiv.org

As the deep learning revolution marches on, self-supervised learning has garnered
increasing attention in recent years thanks to its remarkable representation learning ability …

被引用次数：7 相关文章所有 2 个版本

[PDF] ecva.net

[PDF][PDF] Visionllama: A unified llama backbone for vision tasks

X Chu, J Su, B Zhang, C Shen - European Conference on Computer …, 2024 - ecva.net

We all know that large language models are built on top of a transformer-based architecture
to process textual inputs. For example, the LLaMA family of models stands out among many …

被引用次数：2 相关文章

[PDF] thecvf.com

Multimodal pathway: Improve transformers with irrelevant data from other modalities

Y Zhang, X Ding, K Gong, Y Ge… - Proceedings of the …, 2024 - openaccess.thecvf.com

We propose to improve transformers of a specific modality with irrelevant data from other
modalities eg improve an ImageNet model with audio or point cloud datasets. We would like …

被引用次数：2 相关文章所有 5 个版本

[PDF] arxiv.org

HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

L Sun, Z Lian, B Liu, J Tao - Information Fusion, 2024 - Elsevier

Abstract Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in
recent years for its critical role in creating emotion-aware intelligent machines. Previous …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Pixmim: Rethinking pixel reconstruction in masked image modeling

Y Liu, S Zhang, J Chen, K Chen, D Lin - arXiv preprint arXiv:2303.02416, 2023 - arxiv.org

Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked
Autoencoders (MAE) and BEiT. However, subsequent works have complicated the …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

T3d: Towards 3d medical image understanding through vision-language pre-training

C Liu, C Ouyang, Y Chen, CC Quilodrán-Casas… - arXiv preprint arXiv …, 2023 - arxiv.org

Expert annotation of 3D medical image for downstream analysis is resource-intensive,
posing challenges in clinical applications. Visual self-supervised learning (vSSL), though …

被引用次数：9 相关文章所有 2 个版本

[PDF] thecvf.com

VideoMAC: Video Masked Autoencoders Meet ConvNets

G Pei, T Chen, X Jiang, H Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recently the advancement of self-supervised learning techniques like masked
autoencoders (MAE) has greatly influenced visual representation learning for images and …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Zero-shot ecg classification with multimodal learning and test-time clinical knowledge enhancement

C Liu, Z Wan, C Ouyang, A Shah, W Bai… - arXiv preprint arXiv …, 2024 - arxiv.org

Electrocardiograms (ECGs) are non-invasive diagnostic tools crucial for detecting cardiac
arrhythmic diseases in clinical practice. While ECG Self-supervised Learning (eSSL) …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

X Chu, J Su, B Zhang, C Shen - arXiv preprint arXiv:2403.00522, 2024 - arxiv.org

Large language models are built on top of a transformer-based architecture to process
textual inputs. For example, the LLaMA stands out among many open-source …

被引用次数：5 相关文章所有 2 个版本

[PDF] thecvf.com

Learning to Rank Patches for Unbiased Image Redundancy Reduction

Y Luo, Z Chen, P Zhou, Z Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Images suffer from heavy spatial redundancy because pixels in neighboring regions are
spatially correlated. Existing approaches strive to overcome this limitation by reducing less …

高级搜索

QQ 群