{"id":2982,"date":"2024-01-18T10:02:13","date_gmt":"2024-01-18T02:02:13","guid":{"rendered":"https:\/\/www.1ai.net\/?p=2982"},"modified":"2024-01-18T10:05:12","modified_gmt":"2024-01-18T02:05:12","slug":"%e4%b8%80%e6%96%87%e4%ba%86%e8%a7%a3%e7%94%9f%e6%88%90%e5%bc%8fai%e8%a7%86%e9%a2%91","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/2982.html","title":{"rendered":"Learn about generative AI video in one article"},"content":{"rendered":"<p>Last year was the year of AI video explosion. In January 2023, there is no public text-to-video model. So far,<a href=\"https:\/\/www.1ai.net\/en\/tag\/ai%e8%a7%86%e9%a2%91\" title=\"[View articles tagged with [AI Video]]\" target=\"_blank\" >AI Video<\/a>There are dozens of generated products with millions of users. Let&#039;s review the development of AI-generated videos and noteworthy technologies and applications in the past year. It mainly includes the following aspects:<\/p>\n<ul class=\"list-paddingleft-1\">\n<li>Current AI video classification<\/li>\n<li><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e7%94%9f%e6%88%90%e5%bc%8fai%e8%a7%86%e9%a2%91\" title=\"[SEE ARTICLES WITH [GENERATED AI VIDEO] LABELS]\" target=\"_blank\" >Generative AI Video<\/a>technology<\/li>\n<li>AI video extension technology and applications<\/li>\n<li>Generative AI Video Outlook<\/li>\n<li>\n<section>Challenges of Generative AI Video<\/section>\n<\/li>\n<\/ul>\n<h2>AI Video Classification<\/h2>\n<section>AI videos can basically be divided into the following four categories:<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2983\" title=\"640-79\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-79.png\" alt=\"640-79\" width=\"1080\" height=\"556\" \/><\/section>\n<h3>1. Text\/Picture Generated Video<\/h3>\n<section>As the name suggests, you can generate the corresponding video by entering a text description\/uploading a picture.<\/section>\n<section>The common ones such as Runway, Pika, NeverEnds, Pixverse, svd, etc. belong to this category.<\/section>\n<section>For example, the film style of runway<\/section>\n<section>Pika&#039;s anime style<\/section>\n<section>Portrait Models for NeverEnds<\/section>\n<section>Of course, there are some extended applications, such as Alibaba\u2019s recently popular \u201cKing of Dance\u201d, which is based on the Diffusion Model at the bottom layer and combined with other technologies such as Controlnet, which will be discussed later.<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2984\" title=\"640-8\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-8.gif\" alt=\"640-8\" width=\"640\" height=\"492\" \/><\/section>\n<h3>2. Video to video generation<\/h3>\n<section>It is usually divided into style transfer types, internal video replacement, partial redrawing, and video AI high definition.<\/section>\n<section><strong>Such as WonderStudio&#039;s character CG replacement:<\/strong><\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2985\" title=\"640-9\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-9.gif\" alt=\"640-9\" width=\"1079\" height=\"461\" \/><\/section>\n<section><strong>DomoAI&#039;s video style transfer<\/strong><\/section>\n<section>The technologies involved include: video sequence frame generation and Contorlnet processing, video style transfer Lora, video enlargement, face restoration, etc.<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2986\" title=\"640-10\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-10.gif\" alt=\"640-10\" width=\"640\" height=\"521\" \/><\/section>\n<section><strong>Video Face Swap<\/strong><\/section>\n<section>Common ones include Faceswap, DeepFacelab, etc. The technologies involved include: face detection, feature extraction, face conversion, optimization, etc.<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2988\" title=\"640-12\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-12.gif\" alt=\"640-12\" width=\"486\" height=\"482\" \/><\/section>\n<h3>3. Digital Humans<\/h3>\n<section>Represented by Heygen and D-iD, this is achieved through a combination of face detection, voice cloning TTS, and lip sync technology.<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2987\" title=\"640-11\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-11.gif\" alt=\"640-11\" width=\"640\" height=\"1003\" \/><\/section>\n<h3>4. Video Editing Type<\/h3>\n<section><strong>Material Matching<\/strong><\/section>\n<section>You can search for existing materials and stitch them into a finished video according to your given theme or needs. The most commonly used editing tool is Jianying, which can search for materials online to match your text needs.<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2989\" title=\"640-13\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-13.gif\" alt=\"640-13\" width=\"1079\" height=\"594\" \/><\/section>\n<section><strong>Key part clipping\u00a0<\/strong><\/section>\n<section>Convert long videos into the required short videos, suitable for talk shows. The technologies involved may include using OpenCV and TensorFlow to analyze video content, identify key segments, and then use MoviePy to edit and assemble these segments to form short videos.<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2990\" title=\"640-14\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-14.gif\" alt=\"640-14\" width=\"640\" height=\"426\" \/><\/section>\n<section><strong>Video HD<\/strong><\/section>\n<section>The video quality is improved through super-resolution algorithm, noise reduction algorithm, frame interpolation and other functions.<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2991\" title=\"640-81\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-81.png\" alt=\"640-81\" width=\"995\" height=\"552\" \/><\/section>\n<section>Generative AI Video Technology<\/section>\n<section>As you can see, the applications of the above AI videos are varied, but the underlying technologies are nothing more than the following three: GAN, Diffusion Model, and the Transformer architecture, which has been very popular in the field of large models in the past two years.<\/section>\n<section>Of course, it also includes Variational Autoencoder (VAE) and Diffusion\u2019s predecessor DDPM (Denoising Diffusion Probabilistic Model). We will not go into details here, but mainly introduce the first three in plain language.<\/section>\n<section><strong>1. Generative Adversarial Networks (GANs)<\/strong><\/section>\n<section>Generative adversarial networks<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2992\" title=\"640-80\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-80.png\" alt=\"640-80\" width=\"596\" height=\"204\" \/><\/section>\n<section>As the name implies, GAN consists of a generator and a discriminator. The generator is like a painter, trying to draw realistic images based on text descriptions, while the discriminator is like an appraiser, trying to distinguish which paintings are real and which are drawn by the generator. The two are constantly competing, the generator becomes better and better at drawing realistic images, and the discriminator becomes smarter and smarter at distinguishing between the real and the fake, and finally achieves the generation of more realistic images.<\/section>\n<section><strong>\"Isn't it like when a teacher was a kid and taught you with a watch<\/strong><\/section>\n<section>GAN also has some shortcomings:<\/section>\n<section><strong>distortion:<\/strong>Compared to images generated by diffusion models, GANs tend to have more artifacts and distortions.<\/section>\n<section><strong>Training stability:<\/strong>The training process of GAN involves an adversarial process between a generator and a discriminator, which can lead to unstable training and difficulty in tuning. In contrast, the training process of diffusion models is more stable because they do not rely on adversarial training.<\/section>\n<section><strong>Diversity:<\/strong>Compared to GANs, diffusion models are able to exhibit higher diversity in generating images, meaning they are able to produce richer and more varied images without being overly dependent on specific patterns in the training dataset.<\/section>\n<section>Around 2020, diffusion models started to gain more attention in academia and industry, especially as they performed well in various aspects of image generation.<\/section>\n<section>But this does not mean that GAN is completely outdated. It has also been widely explored and applied in style transfer and super-resolution.<\/section>\n<section><strong>2. Diffusion Model<\/strong><\/section>\n<section>Diffusion Models are inspired by non-equilibrium thermodynamics. The theory first defines a Markov chain of diffusion steps to slowly add random noise to the data, and then learns the inverse diffusion process to construct the desired data samples from the noise.<\/section>\n<section>To explain in layman&#039;s terms, the way a diffusion model works is a bit like a sculptor starting with a rough block of stone (or in our case, a blurry, disordered image) and gradually refining and tweaking it until a fine sculpture (i.e., a clear, meaningful image) is formed.<\/section>\n<section>The Runway and Pika that we are familiar with are actually based on the Diffusion model. However, the details are different. There are two technical architectures for these two products:<\/section>\n<section><strong>Pika \u2013 Per Frame<\/strong><\/section>\n<section>In the \u201cPer Frame\u201d architecture, the diffusion model processes each frame in the video separately, as if they were independent pictures.<\/section>\n<section>The advantage of this method is that it can guarantee the image quality of each frame. However, it cannot effectively capture the temporal coherence and dynamic changes in the video because each frame is processed independently.<\/section>\n<section>Therefore, a certain degree of accuracy will be lost. We see that the early videos generated by Pika are a bit &quot;blurry&quot;, which may be related to this.<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2993\" title=\"640-82\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-82.png\" alt=\"640-82\" width=\"710\" height=\"242\" \/><\/section>\n<section><strong>Runway \u2013 Per Clip<\/strong><\/section>\n<section>The &quot;Per Clip&quot; architecture treats the entire video clip as a single entity.<\/section>\n<section>In this approach, the diffusion model takes into account the temporal relationship and coherence between frames in the video.<\/section>\n<section>Its advantage is that it can better capture and generate the temporal dynamics of videos, including the coherence of motion and behavior, and more completely preserve the accuracy of training video data.<\/section>\n<section>However, the \u201cPer Clip\u201d approach may require a more complex model and more computational resources since it needs to handle the temporal dependencies in the entire video clip.<\/section>\n<section>Compared with Pika&#039;s Per Frame architecture, Per Clip retains the information of the training video material more completely, but its cost is higher and its ceiling is also relatively high.<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2994\" title=\"640-83\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-83.png\" alt=\"640-83\" width=\"725\" height=\"189\" \/><\/section>\n<section>Since the diffusion model itself is computationally intensive, this computational burden will increase dramatically when generating long videos, and temporal consistency is also a considerable test for the diffusion model.<\/section>\n<section>The Transformer architecture is particularly good at processing long sequence data, which is an important advantage for generating long videos. They can better understand and maintain the temporal coherence of video content.<\/section>\n<section><strong>3. Transformer architecture (LLM architecture)<\/strong><\/section>\n<section>In the language model, Transformer learns the rules and structure of language by analyzing a large amount of text, and then infers subsequent text through probability.<\/section>\n<section>When we apply this architecture to image generation, compared to the diffusion model that creates order and meaning from chaos, the application of Transformer in image generation is similar to learning and imitating the &quot;language&quot; of the visual world. For example, it learns how colors, shapes, and objects combine and interact visually, and then uses this information to generate new images.<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2995\" title=\"640-17\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-17.jpeg\" alt=\"640-17\" width=\"1080\" height=\"580\" \/><\/section>\n<section>Transformer architectures have unique advantages, including explicit density modeling and more stable training processes. They are able to exploit the correlation between frames to generate coherent and natural video content.<\/section>\n<section>In addition, the current largest diffusion model has only 7 to 8 billion parameters, but the largest transformer model may have reached trillions of parameters, which is completely two orders of magnitude.<\/section>\n<section>However, the Transformer architecture faces challenges in terms of computing resources, training data volume, and time. Compared with the diffusion model, it requires more model parameters and has relatively higher requirements for computing resources and data sets.<\/section>\n<section>Therefore, in the early days when computing power and data volume were tight, the Transformer architecture for generating videos\/images was not fully explored and applied.<\/section>\n<h2>AI video extension technology and applications<\/h2>\n<h3>&quot;Photo Dance&quot; - Animate anyone<\/h3>\n<h3>Based on diffusion model + Controlnet related technologies<\/h3>\n<section>Technical Overview: The network starts with multiple frames of noise as initial input and adopts a denoising UNet structure based on Stable Diffusion (SD) design. It is similar to the familiar Animatediff, combined with posture control and consistency optimization technologies similar to Controlnet.<\/section>\n<section>The network core consists of three key parts:<\/section>\n<section>1. ReferenceNet, responsible for encoding the appearance features of the characters in the reference image to ensure visual consistency.<\/section>\n<section>2. Pose Guider, used to encode motion control signals to achieve precise control of character movements;<\/section>\n<section>3. Temporal Layer, which processes time series information to ensure the smoothness and naturalness of character movement between consecutive frames. The combination of these three components enables the network to generate animated characters that are visually consistent, motionally controllable, and temporally coherent.<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2996\" title=\"640-18\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-18.jpeg\" alt=\"640-18\" width=\"1080\" height=\"453\" \/><\/section>\n<h3>\u201cConverting live video into animation\u201d\u2014\u2014DomoAI<\/h3>\n<section>The basic model is also based on the Diffusion Model, and is combined with style transfer.<\/section>\n<section>The first step, ControlNet Passes Export, is used to extract the control channels as the basis for making the initial raw animation frames.<\/section>\n<section>The second step is the Animation Raw-LCM, which is at the heart of the work stream and is primarily used to render the main original animations\u3002<\/section>\n<section>The third step is the AnimateDiff Refiner - LCM, which is used to further enhance original animations, add details, magnify and refine\u3002<\/section>\n<section>Finally, AnimateDiff FaceFix \u2014 LCM, dedicated to improving face images that are not ideal after fine-tuned workstream processing\u3002<\/section>\n<section>\u201cAI video face swap\u201d\u2014\u2014Faceswap<\/section>\n<section>In general, face swapping is mainly divided into the following three processes: face detection-feature extraction-face conversion-post-processing<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2997\" title=\"640-84\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-84.png\" alt=\"640-84\" width=\"1007\" height=\"656\" \/><\/section>\n<section>AI video face-swapping technology, commonly referred to as &quot;deepfake,&quot; is based on deep learning, especially using models like GAN (generative adversarial network) or autoencoders. Because the technology is risky to use, it will not be introduced in detail here.<\/section>\n<h2>AI Video Technology Outlook<\/h2>\n<h3>&quot;The future of unification?&quot; - Transformer architecture<\/h3>\n<section>Not only can you see, but you can also hear<\/section>\n<section>Google recently released VideoPoet, a video generation tool that can generate video and audio in one stop, support longer video generation, and provide a good solution to the more common motion consistency in existing video generation, especially the continuity of large-scale motion.<\/section>\n<section><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2998\" title=\"640-85\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/640-85.png\" alt=\"640-85\" width=\"1080\" height=\"528\" \/><\/section>\n<section>Unlike most models used in the video field, VideoPoet did not take the route of diffusion, but was developed along the transformer architecture, integrating multiple video generation functions into a single LLM (Large Language Model Transformer architecture), proving that in addition to its outstanding text generation capabilities, transformers also have great potential in video generation. In addition, it can also generate sound at the same time and support language control to modify videos.<\/section>\n<section><\/section>\n<section>\n<div id=\"page-content\">\n<div id=\"js_mpvedio_wrapper_wxv_3286642012183658500\">\n<div class=\"feed-wrapper\">\n<div class=\"infinity-list__wrapper\">\n<div class=\"\">\n<div class=\"infinity-list__page destory-enter-to\" data-key=\"wxv_3286642012183658500\">\n<div class=\"mp-video-player\" data-v-1adbc698=\"\">\n<div id=\"js_mpvedio_1705543084909_1059274660284\" class=\"js_mpvedio page_video_wrapper\" data-v-a3578893=\"\" data-v-1adbc698=\"\">\n<div class=\"js_page_video page_video ratio_primary align_upper_center page_video_without-control page_video_skin-normal\" data-v-a3578893=\"\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/section>\n<section>&quot;The largest diffusion model has only 7 to 8 billion parameters, but the largest transformer model may have reached the trillion level. In terms of language models, large companies have spent 5 years and invested tens of billions of dollars to bring the models to their current scale. Moreover, as the scale of the model increases, the cost of the large model architecture also increases exponentially.&quot; said Jiang Lu, a scientist at Google.<\/section>\n<section>The video model, which is essentially based on the Grand Language Model Transformer structure, remains a \u201clanguage model\u201d because the training and model framework have not changed. Only the words entered extend to other modes, such as vision, which can also be separated into symbols\u3002<\/section>\n<section>In the early days, we did not see the outstanding effect of Transformer in video generation due to the limitations of resources, computing power, video data, etc. However, in recent years, with the rapid development of large language models brought by GPT and financial support,<strong>In the future, the &quot;one-stop&quot; multimodal large model of text, image, sound and video will attract much attention.<\/strong><\/section>\n<h3><strong>Is AI video also about to usher in its GPT moment?<\/strong><\/h3>\n<section>It is worth noting that although Transformer is the most popular architecture with a highly scalable and parallel neural network architecture, the memory requirement of the full attention mechanism in Transformer is quadratically proportional to the length of the input sequence. When processing high-dimensional signals such as video, this scaling will result in excessive costs.<\/section>\n<section>Therefore, researchers proposed: Window Attention Latent Transformer (WALT): a Transformer-based Latent Video Diffusion Model (LVDM) method. It can also be said that:<\/section>\n<h3><strong>Transformer and Diffusion Model coexist<\/strong><\/h3>\n<section>WALT is a project in collaboration with Professor Fei-Fei Li and her students. WALT is based on diffusion, but also uses transformers. It can be said to combine the advantages of the diffusion model with the powerful functions of the transformer.<\/section>\n<section>In this structure, the diffusion model is responsible for handling the generation and quality details of video images, while the Transformer uses its self-attention mechanism to optimize the correlation and consistency between sequences.<\/section>\n<section>This combination makes the video not only more realistic visually, but also smoother and more natural in motion transitions. Therefore, in the next 1-2 years, Transformer and Diffusion Model will likely coexist.<\/section>\n<h3><strong>Challenges facing AI video technology<\/strong><\/h3>\n<section>In the field of AI video technology, @\u95f2\u4eba\u4e00\u5764, a well-known AI video creator, raised several key challenges.<\/section>\n<section>First, the clarity of the generated video needs to be further improved to achieve higher visual quality. Second, maintaining the consistency of the characters in the video is a difficult problem, which involves accurately capturing and reproducing the characteristics and movements of the characters. Finally, the quality of AI videos needs to be improved.<strong>Controllability needs to be improved, especially in<\/strong>Adjustment capabilities in three-dimensional space<strong>, the current technology is mostly limited to two-dimensional fine-tuning and cannot effectively adjust in the Z-axis dimension<\/strong>These challenges point to key areas that require attention and improvement in the development of AI video technology.<\/section>","protected":false},"excerpt":{"rendered":"<p>Last year was the year that AI video exploded. in January 2023, there was no publicly available text-to-video model. Up to now, there are dozens of AI video generation products and millions of users. Reviewing this year's AI-generated video development + noteworthy technologies and applications, let's talk about the relevant content. It mainly contains the following aspects: current AI video classification Generative AI video technology AI video epitaxial technology and application Generative AI video outlook Generative AI video challenges AI video classification AI video can basically be divided into the following four broad categories: 1. Text\/image generated video As the name suggests, it is to input a text description\/upload a picture to generate the corresponding video. Our common Runway, Pika, NeverE<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[144],"tags":[956,955],"collection":[],"class_list":["post-2982","post","type-post","status-publish","format-standard","hentry","category-baike","tag-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/2982","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=2982"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/2982\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=2982"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=2982"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=2982"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=2982"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}