{"id":10770,"date":"2024-05-21T09:33:51","date_gmt":"2024-05-21T01:33:51","guid":{"rendered":"https:\/\/www.1ai.net\/?p=10770"},"modified":"2024-05-21T09:33:51","modified_gmt":"2024-05-21T01:33:51","slug":"%e6%99%ba%e8%b0%b1%e5%bc%80%e6%ba%90%e6%96%b0%e4%b8%80%e4%bb%a3%e5%a4%9a%e6%a8%a1%e6%80%81%e5%a4%a7%e6%a8%a1%e5%9e%8bcogvlm2","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/10770.html","title":{"rendered":"Zhipu open-sources the next-generation multimodal large model CogVLM2"},"content":{"rendered":"<p><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e6%99%ba%e8%b0%b1\" title=\"[View articles tagged with [Smart Spectrum]]\" target=\"_blank\" >Zhipu<\/a>AI recently announced a new generation<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%a4%9a%e6%a8%a1%e6%80%81%e5%a4%a7%e6%a8%a1%e5%9e%8b\" title=\"[Sees articles with [Multimodal Large Model] labels]\" target=\"_blank\" >Multimodal large model<\/a><a href=\"https:\/\/www.1ai.net\/en\/tag\/cogvlm2\" title=\"_Other Organiser\" target=\"_blank\" >CogVLM2<\/a>, the model has significantly improved key performance indicators compared to the previous generation CogVLM, while supporting 8K text length and images with a resolution of up to 1344*1344. CogVLM2 has improved its performance by 32% on the OCRbench benchmark and 21.9% on the TextVQA benchmark, showing strong document image understanding capabilities. Although the model size of CogVLM2 is 19B, its performance is close to or exceeds the level of GPT-4V.<\/p>\n<p class=\"article-content__img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-10771\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/05\/6385187728541264883470976.jpg\" alt=\"\" width=\"653\" height=\"711\" \/><\/p>\n<p>The technical architecture of CogVLM2 is optimized based on the previous generation model, including a 5-billion-parameter visual encoder and a 7-billion-parameter visual expert module, which finely models the interaction between visual and language sequences through unique parameter settings. This deep fusion strategy enables a closer integration of the visual modality and the language modality while maintaining the model&#039;s advantages in language processing. In addition, the number of parameters actually activated by CogVLM2 during reasoning is only about 12 billion, thanks to its carefully designed multi-expert module structure, which significantly improves reasoning efficiency.<\/p>\n<p>In terms of model performance, CogVLM2 has achieved excellent results in multiple multimodal benchmarks, including TextVQA, DocVQA, ChartQA, OCRbench, MMMU, MMVet, and MMBench. These tests cover a wide range of capabilities from text and image understanding to complex reasoning and interdisciplinary tasks. The two models of CogVLM2 have achieved excellent results in multiple benchmarks.<span class=\"spamTxt\">First<\/span>It has advanced performance, while other performance can reach a level close to that of closed-source models.<\/p>\n<p><strong>Code repository:<\/strong><\/p>\n<p>Github:https:\/\/github.com\/THUDM\/CogVLM2<\/p>\n<p><strong>Model Download:<\/strong><\/p>\n<p>Huggingface:huggingface.co\/THUDM<\/p>\n<p>Moda Community: modelscope.cn\/models\/ZhipuAI<\/p>\n<p>ZhiuAI Community: wisemodel.cn\/models\/ZhipuAI<\/p>\n<p><strong>Demo experience:<\/strong><\/p>\n<p>https:\/\/modelscope.cn\/studios\/ZhipuAI\/Cogvlm2-llama3-chinese-chat-Demo\/summary<\/p>\n<p><strong>CogVLM2 Technical Documentation:<\/strong><\/p>\n<p>https:\/\/zhipu-ai.feishu.cn\/wiki\/OQJ9wk5dYiqk93kp3SKcBGDPnGf<\/p>","protected":false},"excerpt":{"rendered":"<p>Smart Spectrum-AI recently announced the launch of CogVLM2, a next-generation multimodal large model that offers significant improvements in key performance metrics over its predecessor, CogVLM, while supporting 8K text lengths and images with resolutions up to 1344*1344.CogVLM2 delivers a performance improvement of 32% on the OCRbench benchmark, and a 21.9% performance improvement on the TextVQA benchmark, demonstrating strong document image understanding capabilities. The performance improvement is 21.9%, showing strong document image understanding. Although the model size of CogVLM2 is 19B, its performance is close to or exceeds that of GPT-4V. The technical architecture of CogVLM2 has been optimized from the previous generation model and includes a 5 billion parameter vision encoder and a 7 billion parameter vision expert module, which are optimized by a unique<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[2681,602,219,2680],"collection":[],"class_list":["post-10770","post","type-post","status-publish","format-standard","hentry","category-news","tag-cogvlm2","tag-602","tag-219","tag-2680"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/10770","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=10770"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/10770\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=10770"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=10770"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=10770"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=10770"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}