Google Proposes VLOGGER to Generate Realistic Talking and Moving Human Spoken Word Videos

VLOGGER generates photorealistic human videos containing facial and body movements via audio or text inputs, combined with individual images, using a stochastic diffusion model and a 3D human pose representation; a new large-scale diversity dataset, MENTOR, is introduced to provide 3D pose and expression annotations to support VLOGGER training, making it the largest dataset in terms of identity and temporal length; VLOGGER outperforms state-of-the-art methods on multiple public benchmarks, demonstrating advantages in image quality, identity retention, and temporal consistency, while validating its robustness across different diversity dimensions. (AI Mythic Room)

Search