Deep learning in human activity recognition with wearable sensors: A review on advances

S Zhang, Y Li, S Zhang, F Shahabi, S Xia, Y Deng… - Sensors, 2022 - mdpi.com
Mobile and wearable devices have enabled numerous applications, including activity
tracking, wellness monitoring, and human–computer interaction, that measure and improve …

Deep multi-view learning methods: A review

X Yan, S Hu, Y Mao, Y Ye, H Yu - Neurocomputing, 2021 - Elsevier
Multi-view learning (MVL) has attracted increasing attention and achieved great practical
success by exploiting complementary information of multiple features or modalities …

Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation

L Ruan, Y Ma, H Yang, H He, B Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
We propose the first joint audio-video generation framework that brings engaging watching
and listening experiences simultaneously, towards high-quality realistic videos. To generate …

Diffsound: Discrete diffusion model for text-to-sound generation

D Yang, J Yu, H Wang, W Wang… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
Generating sound effects that people want is an important topic. However, there are limited
studies in this area for sound generation. In this study, we investigate generating sound …

Everything at once-multi-modal fusion transformer for video retrieval

N Shvetsova, B Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com
Multi-modal learning from video data has seen increased attention recently as it allows
training of semantically meaningful embeddings without human annotation, enabling tasks …

Visualvoice: Audio-visual speech separation with cross-modal consistency

R Gao, K Grauman - 2021 IEEE/CVF Conference on Computer …, 2021 - ieeexplore.ieee.org
We introduce a new approach for audio-visual speech separation. Given a video, the goal is
to extract the speech associated with a face in spite of simultaneous back-ground sounds …

The sound of pixels

H Zhao, C Gan, A Rouditchenko… - Proceedings of the …, 2018 - openaccess.thecvf.com
We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos,
learns to locate image regions which produce sounds and separate the input sounds into a …

Listen to look: Action recognition by previewing audio

R Gao, TH Oh, K Grauman… - Proceedings of the …, 2020 - openaccess.thecvf.com
In the face of the video data deluge, today's expensive clip-level classifiers are increasingly
impractical. We propose a framework for efficient action recognition in untrimmed video that …

Audio-visual event localization in unconstrained videos

Y Tian, J Shi, B Li, Z Duan, C Xu - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we introduce a novel problem of audio-visual event localization in
unconstrained videos. We define an audio-visual event as an event that is both visible and …

Dancing to music

HY Lee, X Yang, MY Liu, TC Wang… - Advances in neural …, 2019 - proceedings.neurips.cc
Dancing to music is an instinctive move by humans. Learning to model the music-to-dance
generation process is, however, a challenging problem. It requires significant efforts to …