A Survey on Deep Multi-modal Learning for Body Language Recognition and Generation

L Liu, L Gao, W Lei, F Ma, X Lin, J Wang - arXiv preprint arXiv:2308.08849, 2023 - arxiv.org
Body language (BL) refers to the non-verbal communication expressed through physical
movements, gestures, facial expressions, and postures. It is a form of communication that …

Mfr-net: Multi-faceted responsive listening head generation via denoising diffusion model

J Liu, X Wang, X Fu, Y Chai, C Yu, J Dai… - Proceedings of the 31st …, 2023 - dl.acm.org
Face-to-face communication is a common scenario including roles of speakers and
listeners. Most existing research methods focus on producing speaker videos, while the …

OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions

J Liu, X Wang, X Fu, Y Chai, C Yu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
One-shot talking head generation has no explicit head movement reference, thus it is difficult
to generate talking heads with head motions. Some existing works only edit the mouth area …

A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing

M Meng, Y Zhao, B Zhang, Y Zhu, W Shi… - arXiv preprint arXiv …, 2024 - arxiv.org
Talking head synthesis, an advanced method for generating portrait videos from a still image
driven by specific content, has garnered widespread attention in virtual reality, augmented …

ListenFormer: Responsive Listening Head Generation with Non-autoregressive Transformers

M Liu, J Wang, X Qian, H Li - ACM Multimedia 2024 - openreview.net
As one of the crucial elements in human-robot interaction, responsive listening head
generation has attracted considerable attention from researchers. It aims to generate a …

Expformer: Audio-Driven One-Shot Talking Face Generation Based On Expression Transformer

K Liu, X Yi, X Zhao - Available at SSRN 4698118, 2024 - papers.ssrn.com
Audio-driven one-shot talking face generation is challenging due to the semantic gap
between audio and visual representations. Existing methods inadequately tackle the …