Training objective drives the consistency of representational similarity across datasets

L Ciernik, L Linhardt, M Morik, J Dippel… - arXiv preprint arXiv …, 2024 - arxiv.org
The Platonic Representation Hypothesis claims that recent foundation models are
converging to a shared representation space as a function of their downstream task …

Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models

OC Phukan, S Jain, SR Behera, AB Buduru… - arXiv preprint arXiv …, 2024 - arxiv.org
In this study, for the first time, we extensively investigate whether music foundation models
(MFMs) or speech foundation models (SFMs) work better for singing voice deepfake …

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

M Maniparambil, R Akshulakov, YAD Djilali… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent contrastive multimodal vision-language models like CLIP have demonstrated robust
open-world semantic understanding, becoming the standard image backbones for vision …