{"id":41887,"date":"2025-08-28T11:14:17","date_gmt":"2025-08-28T03:14:17","guid":{"rendered":"https:\/\/www.1ai.net\/?p=41887"},"modified":"2025-08-28T18:46:49","modified_gmt":"2025-08-28T10:46:49","slug":"%e8%a1%8c%e4%b8%9a%e9%a6%96%e4%b8%aa%ef%bc%9a8b-%e5%8f%82%e6%95%b0%e9%9d%a2%e5%a3%81%e5%b0%8f%e9%92%a2%e7%82%ae-minicpm-v-4-5-%e5%bc%80%e6%ba%90%ef%bc%8c%e5%8f%b7%e7%a7%b0%e6%9c%80%e5%bc%ba","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/41887.html","title":{"rendered":"Industry First: 8B Parametric Faceplate MiniCPM-V 4.5 Open-Source, \"The Strongest End-Side Multimodal Model\""},"content":{"rendered":"<p>August 27th.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e9%9d%a2%e5%a3%81%e6%99%ba%e8%83%bd\" title=\"[View articles tagged with [face smart]]\" target=\"_blank\" >Wall-facing intelligence<\/a> Announced on August 26<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%bc%80%e6%ba%90\" title=\"[View articles tagged with [open source]]\" target=\"_blank\" >Open Source<\/a> MiniCPM-V 4.5 multimodal flagship model with 8B parameters, becoming the industry's first \"high-brush\" video comprehension capability.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%a4%9a%e6%a8%a1%e6%80%81%e6%a8%a1%e5%9e%8b\" title=\"[View articles tagged with [multimodal model]]\" target=\"_blank\" >Multimodal Model<\/a>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-41888\" title=\"96f13c91j00t1ooys00bmd000u0013zp\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/08\/96f13c91j00t1ooys00bmd000u0013zp.jpg\" alt=\"96f13c91j00t1ooys00bmd000u0013zp\" width=\"1080\" height=\"1439\" \/><\/p>\n<p>MiniCPM-V 4.5 is claimed to be the \"strongest end-side multimodal model\" with the same level of SOTA in high brushed video comprehension, long video comprehension, OCR, and document parsing, and the performance exceeds that of Qwen2.5-VL 72B.<\/p>\n<p>Facing Wall Intelligence said that the mainstream multimodal model usually adopts 1 fps frame extraction, i.e., it can only capture 1 frame per second for recognition and understanding, because of balancing arithmetic power, power consumption and other factors in dealing with video comprehension. Although this ensures the inference efficiency of the model to a certain extent, most of the visual information is missing, which reduces the multimodal model's \"fine-grained\" understanding of the dynamic world.<\/p>\n<p>MiniCPM-V 4.5 is the industry's first multimodal model with high-brush video comprehension capability. By expanding the model structure from 2D-Resampler to 3D-Resampler, it performs high-density compression of 3D video clips, and receives up to 6 times the maximum number of video frames with the same visual token volume overhead, achieving a 96x visual compression rate, 12-24 times higher than similar models. 12-24 times that of similar models.<\/p>\n<p>MiniCPM-V 4.5 significantly increases the frequency of frame extraction, from watching \"PowerPoint\" to understanding \"motion picture\". In the face of flickering images, MiniCPM-V 4.5 can see more accurately and in greater detail than representative cloud models such as Gemini-2.5-Pro, GPT-5, GPT-4o, and so on.<\/p>\n<p>In the MotionBench and FavorBench lists, which reflect the comprehension of high-brush video, MiniCPM-V 4.5 reaches the same-size SOTA and exceeds Qwen2.5-VL 72B, realizing a class-leading level.<\/p>\n<p>With 8B parameters, MiniCPM-V 4.5 once again refreshes the upper limit of capability in multimodal capabilities such as image understanding, video understanding, and complex document recognition.<\/p>\n<p>In terms of image understanding performance, MiniCPM-V 4.5 is ahead of many closed-source models such as GPT-4o, GPT-4.1, Gemini-2.0-Pro, etc., and even outperforms Qwen2.5-VL 72B in the OpenCompass measurements, realizing a leap ahead.<\/p>\n<p>In terms of video comprehension performance, MiniCPM-V 4.5 achieves best-in-class in the LVBench, MLVU, Video-MME, LongVideoBench and other lists.<\/p>\n<p>In the complex document recognition task, MiniCPM-V 4.5 achieves the same level of SOTA performance as the general multimodal model in the OmniDocBench list for OverallEdit, TextEdit and TableEdit.<\/p>\n<p>In addition, MiniCPM-V 4.5 supports both regular and deep thinking modes to achieve a balance between performance and responsiveness, with regular mode providing excellent multimodal understanding in most scenarios, and deep thinking mode focusing on responding to complex and compound reasoning tasks.<\/p>\n<p>In VideoMME, a video comprehension list, and OpenCompass, a single-image test, MiniCPM-V 4.5 reaches the SOTA level of its class, and achieves the lead in terms of video memory usage and average inference time.<\/p>\n<p>On Video-MME, a video comprehension test set covering short, medium, and long video types, MiniCPM-V 4.5 employs a 3-frame packing strategy for inference, with a time overhead (not counting the model's pumping time) that is only 1\/10 of that of comparable models.<\/p>\n<p>1AI Attached model open source link:<\/p>\n<ul>\n<li>Github: https:\/\/github.com\/OpenBMB\/MiniCPM-o<\/li>\n<li>Hugging Face: https:\/\/huggingface.co\/openbmb\/MiniCPM-V-4_5<\/li>\n<li>ModelScope: https:\/\/www.modelscope.cn\/models\/OpenBMB\/MiniCPM-V-4_5<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>August 27 news, the face of the wall intelligent August 26 announced open source 8B parameters of the face of the wall MiniCPM-V 4.5 multimodal flagship model, becoming the industry's first multimodal model with \"high brush\" video comprehension capabilities. MiniCPM-V 4.5 is said to have high brush video comprehension, long video comprehension, OCR, document parsing capabilities of the same level of SOTA, and the performance exceeds the Qwen2.5-VL 72B, so it is called \"the strongest end-side multimodal model\". Facade Intelligence said that the previous mainstream multimodal model in dealing with video comprehension tasks, because of the balance of arithmetic power, power consumption and other factors, usually take 1 fps frame extraction, that is, only 1 frame per second can be intercepted for recognition and comprehension. Although the model is guaranteed to a certain extent<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[1096,219,2184],"collection":[],"class_list":["post-41887","post","type-post","status-publish","format-standard","hentry","category-news","tag-1096","tag-219","tag-2184"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/41887","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=41887"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/41887\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=41887"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=41887"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=41887"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=41887"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}