{"id":5820,"date":"2024-03-20T09:31:18","date_gmt":"2024-03-20T01:31:18","guid":{"rendered":"https:\/\/www.1ai.net\/?p=5820"},"modified":"2024-03-20T09:31:18","modified_gmt":"2024-03-20T01:31:18","slug":"%e8%b0%b7%e6%ad%8c%e6%8e%a8%e5%87%ba%e5%a4%9a%e6%a8%a1%e6%80%81-vlogger-ai%ef%bc%9a%e8%ae%a9%e9%9d%99%e6%80%81%e8%82%96%e5%83%8f%e5%9b%be%e5%8a%a8%e8%b5%b7%e6%9d%a5%e8%af%b4%e8%af%9d","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/5820.html","title":{"rendered":"Google launches multimodal VLOGGER AI: making static portraits move and &quot;talk&quot;"},"content":{"rendered":"<p data-vmark=\"6013\"><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e8%b0%b7%e6%ad%8c\" title=\"[View articles tagged with [Google]]\" target=\"_blank\" >Google<\/a>Recently, a blog post was published on the GitHub page, introducing <a href=\"https:\/\/www.1ai.net\/en\/tag\/vlogger\" title=\"_OTHER ORGANISER\" target=\"_blank\" >VLOGGER<\/a> <a href=\"https:\/\/www.1ai.net\/en\/tag\/ai%e6%a8%a1%e5%9e%8b\" title=\"[View articles tagged with [AI models]]\" target=\"_blank\" >AI Models<\/a>, users only need to input a portrait photo and an audio content,<strong>The model can make these characters &quot;animate&quot; and read the audio content with rich facial expressions.<\/strong><\/p>\n<p data-vmark=\"4578\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5821\" title=\"6bac04a7-8b16-4c44-92ed-00a92f97be15\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/03\/6bac04a7-8b16-4c44-92ed-00a92f97be15.jpg\" alt=\"6bac04a7-8b16-4c44-92ed-00a92f97be15\" width=\"1070\" height=\"428\" \/><\/p>\n<p data-vmark=\"e6e0\">VLOGGER AI is a virtual portrait<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%a4%9a%e6%a8%a1%e6%80%81\" title=\"[View articles tagged with [multimodal]]\" target=\"_blank\" >Multimodality<\/a> The Diffusion model is trained using the MENTOR database, which contains portraits of more than 800,000 people and more than 2,200 hours of video, allowing VLOGGER to generate portrait videos of different races, ages, clothing, and poses.<\/p>\n<p data-vmark=\"bf67\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5822\" title=\"2df76088-33d8-4775-8f64-5f726100e8f7\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/03\/2df76088-33d8-4775-8f64-5f726100e8f7.jpg\" alt=\"2df76088-33d8-4775-8f64-5f726100e8f7\" width=\"1070\" height=\"452\" \/><\/p>\n<p data-vmark=\"076e\">The researchers said: &quot;Compared to previous multimodal methods, VLOGGER has the advantages of not requiring training for each individual, not relying on face detection and cropping, generating complete images (not just faces or lips), and considering a wide range of scenarios (such as visible torsos or different subject identities), which are critical for correctly synthesizing communicating humans.&quot;<\/p>\n<p data-vmark=\"c757\">Google sees VLOGGER as a step towards a &quot;universal chatbot,&quot; after which AI can interact with humans in a natural way through voice, gestures, and eye contact.<\/p>\n<p data-vmark=\"9dba\">VLOGGER&#039;s application scenarios also include reports, educational fields, and narration. It can also be used to edit existing videos, and if you are not satisfied with the expressions in the video, you can make adjustments.<\/p>\n<p data-vmark=\"1fe1\">Attach the paper reference<\/p>\n<ul class=\"custom_reference list-paddingleft-1\">\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"b4dc\"><a href=\"https:\/\/enriccorona.github.io\/vlogger\/\" target=\"_blank\" rel=\"noopener\">VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis<\/a><\/p>\n<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Google recently published a blog post on its GitHub page introducing the VLOGGER AI model, which allows users to input a portrait photo and a piece of audio content, and then the model can make these characters \"move\" and read out the audio content with facial expressions. VLOGGER AI is a multimodal Diffusion model for virtual portraits, trained using the MENTOR database, which contains more than 800,000 portraits and a cumulative total of more than 2,200 hours of film, allowing VLOGGER to generate portraits of different races, ages, dresses and poses. Compared to previous multimodal approaches, VLOGGER has the advantage of not needing to perform a multimodal analysis on every<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[167,1762,592,281],"collection":[],"class_list":["post-5820","post","type-post","status-publish","format-standard","hentry","category-news","tag-ai","tag-vlogger","tag-592","tag-281"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/5820","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=5820"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/5820\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=5820"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=5820"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=5820"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=5820"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}