Self-supervised video forensics by audio-visual anomaly detection

C Feng, Z Chen, A Owens - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Manipulated videos often contain subtle inconsistencies between their visual and audio
signals. We propose a video forensics method, based on anomaly detection, that can …

Audio-visual generalised zero-shot learning with cross-modal attention and language

OB Mercea, L Riesch, A Koepke… - Proceedings of the …, 2022 - openaccess.thecvf.com
Learning to classify video data from classes not included in the training data, ie video-based
zero-shot learning, is challenging. We conjecture that the natural alignment between the …

Audio-synchronized visual animation

L Zhang, S Mo, Y Zhang, P Morgado - European Conference on Computer …, 2025 - Springer
Current visual generation methods can produce high-quality videos guided by text prompts.
However, effectively controlling object dynamics remains a challenge. This work explores …

Audio-visual segmentation with semantics

J Zhou, X Shen, J Wang, J Zhang, W Sun… - International Journal of …, 2024 - Springer
We propose a new problem called audio-visual segmentation (AVS), in which the goal is to
output a pixel-level map of the object (s) that produce sound at the time of the image frame …

Masked generative video-to-audio transformers with enhanced synchronicity

S Pascual, C Yeh, I Tsiamas, J Serrà - European Conference on Computer …, 2025 - Springer
Abstract Video-to-audio (V2A) generation leverages visual-only video features to render
plausible sounds that match the scene. Importantly, the generated sound onsets should …

Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds

Y Zhang, Y Gu, Y Zeng, Z Xing, Y Wang, Z Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing
with videos, enabling an immersive audio-visual experience. Despite its wide range of …

Reading to listen at the cocktail party: Multi-modal speech separation

A Rahimi, T Afouras… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
The goal of this paper is speech separation and enhancement in multi-speaker and noisy
environments using a combination of different modalities. Previous works have shown good …

Self-supervised audio-visual soundscape stylization

T Li, R Wang, PY Huang, A Owens… - … on Computer Vision, 2025 - Springer
Speech sounds convey a great deal of information about the scenes, resulting in a variety of
effects ranging from reverberation to additional ambient sounds. In this paper, we …

Vocalist: An audio-visual synchronisation model for lips and voices

VS Kadandale, JF Montesinos, G Haro - arXiv preprint arXiv:2204.02090, 2022 - arxiv.org
In this paper, we address the problem of lip-voice synchronisation in videos containing
human face and voice. Our approach is based on determining if the lips motion and the …

Sparse in space and time: Audio-visual synchronisation with trainable selectors

V Iashin, W Xie, E Rahtu, A Zisserman - arXiv preprint arXiv:2210.07055, 2022 - arxiv.org
The objective of this paper is audio-visual synchronisation of general videos' in the wild'. For
such videos, the events that may be harnessed for synchronisation cues may be spatially …