Aug. 27 - Yesterday evening.Alibaba CloudAnnounceOpen SourceNew MultimodalVideo Generation ModelTongyi Wanxiang Wan2.2-S2V, with just one still image and one piece of audio, can generate movie-quality natural facial expression, consistent mouth shape and silky smooth body movement.Digital HumanVideo.

According to reports, the video length generated by the model in a single pass can reach the minute level, which significantly improves the efficiency of video creation in industries such as digital human live broadcasting, film and television production, and AI education.
Currently, Wan2.2-S2V can drive pictures of real people, cartoons, animals, digital people and other types of pictures, and supports portrait, half-body and full-body and other arbitrary frame, and after uploading a piece of audio, the model can make the subject image in the picture complete the action of talking, singing and acting.
The Wan2.2-S2V also supports text control. Entering Prompt also allows you to control the video screen, allowing for more movement of the video subject and changes in the background.
For example, by uploading a photo of a character playing piano, a song and a paragraph of text, Wan2.2-S2V can generate a complete, full-voiced piano performance video, which not only ensures that the character's image is consistent with the original picture, but also aligns its facial expression and mouth movement with the audio, and the video character's finger shape, strength and speed can perfectly match the rhythm of the audio.

According to the introduction, Wan2.2-S2V adopts the basic model capability of video generation based on generalized Wan phase, integrating text-guided global motion control and audio-driven fine-grained local motion, realizing audio-driven video generation for complex scenes; meanwhile, it introduces two kinds of control mechanisms, namely AdaIN and CrossAttention, realizing a more accurate and more dynamic audio control effect; in order to guarantee the effect of long video generation, Wan2.2-S2V greatly reduces the number of history frames through hierarchical frame compression technology, by which the motion frames (history reference frames) are greatly reduced. In order to guarantee the effect of long video generation, Wan2.2-S2V greatly reduces the number of tokens of history frames through hierarchical frame compression technology, and expands the length of motion frames (IT House note: history reference frames) from several frames to 73 frames, thus realizing stable long video generation effect.
In the model training, the Tongyi team constructed an audio and video dataset of more than 600,000 clips, and carried out fully parameterized training through hybrid parallel training, which fully exploited the performance of the model. At the same time, through multi-resolution training, support for multi-resolution reasoning of the model can support the video generation needs of different resolution scenarios, such as vertical screen short videos, horizontal screen movie and TV dramas.
The measured data shows that Wan2.2-S2V achieves the best results of similar models in core metrics such as FID (video quality, the lower the better), EFID (expression fidelity, the lower the better), and CSIM (identity consistency, the higher the better).
Aliyun said that since February this year, Tongyi Wanxiang has continuously open source text-born video, figure-born video, first and last frame-born video, omnipotent editing, audio-born video and other models, and the number of downloads in the open source community and the three-party platform has exceeded 20 million.
Open source address:
- GitHub: https://github.com/Wan-Video/Wan2.2
- Magic Match Community: https://www.modelscope.cn/ models / Wan-AI / Wan2.2-S2V-14B
- HuggingFace:https://huggingface.co/Wan-AI/Wan2.2-S2V-14B
Experience address:
- Tongyi Wanxiang official website: https://tongyi.aliyun.com/ wanxiang / generate
- AliCloud Hundred Refinements: https://bailian.console.aliyun.com/?tab=api#/api/?type=model&url=2978215