
LatentSyncIt is an end-to-end lip-synchronization framework jointly launched by ByteDance and Beijing Jiaotong University. It is based on audio-driven latent diffusion models (audio-driven latent diffusion models) and aims to achieve seamless temporal consistency and generate high-quality, realistic speaking videos. The framework is suitable for a wide range of application scenarios such as voice-over, virtual avatars, game development, and more.
LatentSync Features
- End-to-End Lip Synchronization: Latent Sync models complex audio-video relationships directly in latent space without any intermediate motion representation. It accurately generates lip movements that match the input audio, enabling precise synchronization of lip shape with speech.
- High-resolution video generation: Latent Sync overcomes the high hardware requirements of traditional diffusion models when diffusing in pixel space, and is capable of generating high-resolution video.
- Dynamic Realistic Effect: The generated video has a dynamic realistic effect, which can capture the subtle expressions related to the emotional tone and make the character's speech more natural and vivid.
- Temporal Consistency Enhancement: Latent Sync introduces the Temporal REPresentation Alignment (TREPA) method, which extracts temporal representations through a large-scale self-supervised video model to enhance the temporal consistency between generated frames and real frames, reduce video flickering phenomenon, and make video playback smoother.
- Multi-language support: Latent Sync supports multi-language processing for international content localization.
Official website link:https://www.latentsync.org