AliTongyi Wanxiang「first and last frame-generated video model"April 17 announcementOpen SourceThe model has a parameter count of 14B and is claimed to be the industry's first open source first and last frame video model with a parameter scale of 10 billion.

It can start and end pictures as specified by the user, theGenerate a 720p HD video that connects the beginning and end frames.This upgrade will meet the needs of users for more controllable and customized video generation.
Users can experience the model for free on the official website of Tongyi Wanxiang, or download the model from Github, Hugging Face, or Magic Hitch Community for local deployment and secondary development.
Technical Introduction
The first and last frame-generated video is more controllable than text-generated video and single-image-generated video, but the training of this kind of model is more difficult, and the first and last frame-generated video needs to satisfy the following points at the same time:
1. Generate video content that is consistent with the two images entered by the user
2. Ability to follow user prompted word instructions
3. The ability to make a natural and smooth transition from a given first frame to the last frame
4. The video itself is coordinated and natural
Training and reasoning optimization
Based on the existing Wan2.1 literate video base model architecture, the Tongyi Wanphase first and last frames raw video model further introduces an additional conditional control mechanism, through which smooth and accurate first and last frame transformations can be realized.
In the training phase, the team also constructed training data specifically for the first and last frame modes, and at the same time used parallel strategies for the text and video coding module and the diffusion transformation model module, which improve the efficiency of model training and generation, and also guarantee that the model has the effect of high-resolution video generation.
In the inference stage, in order to support HD video inference under the condition of limited memory resources, the first and last frame models of WanPhase adopt the model slicing strategy as well as the sequence parallelism strategy respectively, which significantly shorten the inference time under the premise of ensuring that the inference effect is not compromised.
Functionality Upgrade
Based on this model, users can complete more complex and more personalized video generation tasks, which can achieve the same subject of special effects changes, different scenes of the operation of the camera control and other video generation.
For example, upload two pictures of the same location in different time periods, enter a prompt word, Tongyi Wanxiang first and last frame generation model can generate a video of the alternating seasons or day and night changes in the effect of time-lapse photography; upload two different images of the scene, but also through the rotating, panning, advancing, and other operations of the mirror control articulation of the screen, to ensure consistency between the video and the preset picture under the premise of the same time, and at the same time, let the video have a more rich The video can also be controlled by rotating, panning, advancing and other lens controls.
1AI Attached open source address:
-
HuggingFace:https://huggingface.co/Wan-AI/Wan2.1-FLF2V-14B-720P
-
Magic Match Community:https://www.modelscope.cn/models/Wan-AI/Wan2.1-FLF2V-14B-720P
-
Direct Experience Portal:https://tongyi.aliyun.com/wanxiang/videoCreation