{"id":41837,"date":"2025-08-27T12:21:41","date_gmt":"2025-08-27T04:21:41","guid":{"rendered":"https:\/\/www.1ai.net\/?p=41837"},"modified":"2025-08-27T12:21:51","modified_gmt":"2025-08-27T04:21:51","slug":"%e4%b8%80%e5%bc%a0%e5%9b%be%e5%8d%b3%e5%8f%af%e7%94%9f%e6%88%90%e7%94%b5%e5%bd%b1%e7%ba%a7%e6%95%b0%e5%ad%97%e4%ba%ba%e8%a7%86%e9%a2%91%ef%bc%9a%e9%98%bf%e9%87%8c%e4%ba%91%e9%80%9a%e4%b9%89%e4%b8%87","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/41837.html","title":{"rendered":"A picture can generate cinematic digital human video: AliCloud Tongyi Wan2.2-S2V video generation model announced open source"},"content":{"rendered":"<p>Aug. 27 - Yesterday evening.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e9%98%bf%e9%87%8c%e4%ba%91\" title=\"_Other Organiser\" target=\"_blank\" >Alibaba Cloud<\/a>Announce<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%bc%80%e6%ba%90\" title=\"[View articles tagged with [open source]]\" target=\"_blank\" >Open Source<\/a>New Multimodal<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e8%a7%86%e9%a2%91%e7%94%9f%e6%88%90%e6%a8%a1%e5%9e%8b\" title=\"_Other Organiser\" target=\"_blank\" >Video Generation Model<\/a><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e9%80%9a%e4%b9%89%e4%b8%87%e7%9b%b8\" title=\"_Other Organiser\" target=\"_blank\" >Tongyi Wanxiang<\/a> Wan2.2-S2V, with just one still image and one piece of audio, can generate movie-quality natural facial expression, consistent mouth shape and silky smooth body movement.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e6%95%b0%e5%ad%97%e4%ba%ba\" title=\"[View articles tagged with [digital people]]\" target=\"_blank\" >Digital Human<\/a>Video.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-41838\" title=\"fb04c776j00t1mxec008fd000v900fpp\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/08\/fb04c776j00t1mxec008fd000v900fpp.jpg\" alt=\"fb04c776j00t1mxec008fd000v900fpp\" width=\"1125\" height=\"565\" \/><\/p>\n<p>According to reports, the video length generated by the model in a single pass can reach the minute level, which significantly improves the efficiency of video creation in industries such as digital human live broadcasting, film and television production, and AI education.<\/p>\n<p>Currently, Wan2.2-S2V can drive pictures of real people, cartoons, animals, digital people and other types of pictures, and supports portrait, half-body and full-body and other arbitrary frame, and after uploading a piece of audio, the model can make the subject image in the picture complete the action of talking, singing and acting.<\/p>\n<p>The Wan2.2-S2V also supports text control. Entering Prompt also allows you to control the video screen, allowing for more movement of the video subject and changes in the background.<\/p>\n<p>For example, by uploading a photo of a character playing piano, a song and a paragraph of text, Wan2.2-S2V can generate a complete, full-voiced piano performance video, which not only ensures that the character's image is consistent with the original picture, but also aligns its facial expression and mouth movement with the audio, and the video character's finger shape, strength and speed can perfectly match the rhythm of the audio.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-41839\" title=\"aa102483j00t1mxf400ipd000v900jpp\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/08\/aa102483j00t1mxf400ipd000v900jpp.jpg\" alt=\"aa102483j00t1mxf400ipd000v900jpp\" width=\"1125\" height=\"709\" \/><\/p>\n<p>According to the introduction, Wan2.2-S2V adopts the basic model capability of video generation based on generalized Wan phase, integrating text-guided global motion control and audio-driven fine-grained local motion, realizing audio-driven video generation for complex scenes; meanwhile, it introduces two kinds of control mechanisms, namely AdaIN and CrossAttention, realizing a more accurate and more dynamic audio control effect; in order to guarantee the effect of long video generation, Wan2.2-S2V greatly reduces the number of history frames through hierarchical frame compression technology, by which the motion frames (history reference frames) are greatly reduced. In order to guarantee the effect of long video generation, Wan2.2-S2V greatly reduces the number of tokens of history frames through hierarchical frame compression technology, and expands the length of motion frames (IT House note: history reference frames) from several frames to 73 frames, thus realizing stable long video generation effect.<\/p>\n<p>In the model training, the Tongyi team constructed an audio and video dataset of more than 600,000 clips, and carried out fully parameterized training through hybrid parallel training, which fully exploited the performance of the model. At the same time, through multi-resolution training, support for multi-resolution reasoning of the model can support the video generation needs of different resolution scenarios, such as vertical screen short videos, horizontal screen movie and TV dramas.<\/p>\n<p>The measured data shows that Wan2.2-S2V achieves the best results of similar models in core metrics such as FID (video quality, the lower the better), EFID (expression fidelity, the lower the better), and CSIM (identity consistency, the higher the better).<\/p>\n<p>Aliyun said that since February this year, Tongyi Wanxiang has continuously open source text-born video, figure-born video, first and last frame-born video, omnipotent editing, audio-born video and other models, and the number of downloads in the open source community and the three-party platform has exceeded 20 million.<\/p>\n<p>Open source address:<\/p>\n<ul>\n<li>GitHub: https:\/\/github.com\/Wan-Video\/Wan2.2<\/li>\n<li>Magic Match Community: https:\/\/www.modelscope.cn\/ models \/ Wan-AI \/ Wan2.2-S2V-14B<\/li>\n<li>HuggingFace:https:\/\/huggingface.co\/Wan-AI\/Wan2.2-S2V-14B<\/li>\n<\/ul>\n<p><strong>Experience address:<\/strong><\/p>\n<ul>\n<li>Tongyi Wanxiang official website: https:\/\/tongyi.aliyun.com\/ wanxiang \/ generate<\/li>\n<li>AliCloud Hundred Refinements: https:\/\/bailian.console.aliyun.com\/?tab=api#\/api\/?type=model&amp;url=2978215<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>August 27 news, yesterday evening, Ali Cloud announced the open source of a new multimodal video generation model Tongyi Wan phase Wan2.2-S2V, only a static picture and a piece of audio, you can generate a natural facial expression, consistent mouth shape, body movement silky smooth movie level digital human video. According to reports, the length of the video generated by this model can be up to minute level, which significantly improves the efficiency of video creation in the industries of digital human live broadcasting, movie and TV production, and AI education. Currently, Wan2.2-S2V can drive pictures of real people, cartoons, animals, digital people and other types of pictures, and supports portrait, half-body and full-body and other arbitrary frame, and after uploading a piece of audio, the model allows the subject image in the picture to complete the action of talking, singing and performing. Wan2.2-S2V also supports text control, input and output.<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[219,1252,460,621,334],"collection":[],"class_list":["post-41837","post","type-post","status-publish","format-standard","hentry","category-news","tag-219","tag-1252","tag-460","tag-621","tag-334"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/41837","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=41837"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/41837\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=41837"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=41837"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=41837"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=41837"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}