{"id":15469,"date":"2024-07-13T12:09:53","date_gmt":"2024-07-13T04:09:53","guid":{"rendered":"https:\/\/www.1ai.net\/?p=15469"},"modified":"2024-07-13T12:09:53","modified_gmt":"2024-07-13T04:09:53","slug":"%e6%99%ba%e8%b0%b1-ai-%e5%bc%80%e6%ba%90%e8%a7%86%e9%a2%91%e7%90%86%e8%a7%a3%e6%a8%a1%e5%9e%8b-cogvlm2-video%ef%bc%8c%e5%8f%af%e5%9b%9e%e7%ad%94%e6%97%b6%e9%97%b4%e7%9b%b8%e5%85%b3%e9%97%ae%e9%a2%98","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/15469.html","title":{"rendered":"Zhipu AI open-sources video understanding model CogVLM2-Video, which can answer time-related questions"},"content":{"rendered":"<p data-vmark=\"1c5f\"><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e6%99%ba%e8%b0%b1ai\" title=\"[SEES ARTICLES WITH [INTELLIGENCE AI] LABELS]\" target=\"_blank\" >Zhipu AI<\/a> Announced that a new video understanding model has been trained <a href=\"https:\/\/www.1ai.net\/en\/tag\/cogvlm2-video\" title=\"_Other Organiser\" target=\"_blank\" >CogVLM2-Video<\/a>, and<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%bc%80%e6%ba%90\" title=\"[View articles tagged with [open source]]\" target=\"_blank\" >Open Source<\/a>.<\/p>\n<p data-vmark=\"8029\">It is reported that most current video understanding models use frame averaging and video tag compression methods, which results in the loss of temporal information and cannot accurately answer time-related questions. Some models that focus on temporal question-answering datasets are too limited to specific formats and applicable fields, causing the models to lose broader question-answering capabilities.<\/p>\n<p data-vmark=\"20ef\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-15470\" title=\"3b1da7b1-3a10-408d-929e-2a13414106da\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/07\/3b1da7b1-3a10-408d-929e-2a13414106da.png\" alt=\"3b1da7b1-3a10-408d-929e-2a13414106da\" width=\"1290\" height=\"548\" \/><\/p>\n<p>\u25b2 Official effect demonstration<\/p>\n<p data-vmark=\"1d16\">Zhipu AI proposed a<strong>Automatic time positioning data construction method based on visual model<\/strong>, generating 30,000 time-related video question-answering data. Based on this new dataset and existing open-domain question-answering data, we introduced multi-frame video images and timestamps as encoder inputs and trained the CogVLM2-Video model.<\/p>\n<p data-vmark=\"5b8c\">Zhipu AI said that CogVLM2-Video not only achieved state-of-the-art performance on public video understanding benchmarks, but also excelled in video subtitle generation and temporal positioning.<\/p>\n<p data-vmark=\"4e95\">Attached related links:<\/p>\n<ul class=\"medium-size list-paddingleft-2\">\n<li>\n<p data-vmark=\"7205\">Code:<a href=\"https:\/\/github.com\/THUDM\/CogVLM2\" target=\"_blank\" rel=\"noopener\"><span class=\"link-text-start-with-http\">https:\/\/github.com\/THUDM\/CogVLM2<\/span><\/a><\/p>\n<\/li>\n<li>\n<p data-vmark=\"65be\">Project website:<a href=\"https:\/\/cogvlm2-video.github.io\/\" target=\"_blank\" rel=\"noopener\"><span class=\"link-text-start-with-http\">https:\/\/cogvlm2-video.github.io<\/span><\/a><\/p>\n<\/li>\n<li>\n<p data-vmark=\"4ee6\">Online Trial:<a href=\"http:\/\/36.103.203.44:7868\/\" target=\"_blank\" rel=\"noopener\"><span class=\"link-text-start-with-http\">http:\/\/36.103.203.44:7868\/<\/span><\/a><\/p>\n<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Smart Spectrum AI has announced the training of a new video understanding model, CogVLM2-Video, and has made it open source. According to the report, most current video understanding models use frame averaging and video tagging compression methods, resulting in the loss of temporal information and the inability to accurately answer time-related questions. Some models that focus on temporal Q&amp;A datasets are too limited to specific formats and applicable domains, making the models lose broader Q&amp;A capabilities. \u25b2 Official Results Demonstration Wisdom Spectrum AI proposes an automatic time-targeted data construction method based on a visual model, generating 30,000 time-related video Q&amp;A data. Based on this new dataset and existing open-domain Q&amp;A data, multi-frame video images and timestamps are introduced as encoder inputs, and CogVLM2 is trained.<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[3470,219,379],"collection":[],"class_list":["post-15469","post","type-post","status-publish","format-standard","hentry","category-news","tag-cogvlm2-video","tag-219","tag-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/15469","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=15469"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/15469\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=15469"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=15469"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=15469"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=15469"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}