{"id":18936,"date":"2024-08-30T09:45:09","date_gmt":"2024-08-30T01:45:09","guid":{"rendered":"https:\/\/www.1ai.net\/?p=18936"},"modified":"2024-08-30T09:45:30","modified_gmt":"2024-08-30T01:45:30","slug":"%e9%98%bf%e9%87%8c%e9%80%9a%e4%b9%89%e5%8d%83%e9%97%ae%e6%8e%a8%e5%87%ba-qwen2-vl%ef%bc%9a%e5%bc%80%e6%ba%90-2b-7b-%e6%a8%a1%e5%9e%8b%ef%bc%8c%e8%83%bd%e7%90%86%e8%a7%a3%e8%b6%85-20-%e5%88%86","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/18936.html","title":{"rendered":"Alitong YiQianwen launches Qwen2-VL: open source 2B\/7B model, able to understand over 20 minutes of video"},"content":{"rendered":"<p><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e9%98%bf%e9%87%8c%e5%b7%b4%e5%b7%b4\" title=\"[Sees articles with [Aribaba] label]\" target=\"_blank\" >Alibaba<\/a>The cloud computing division of <a href=\"https:\/\/www.1ai.net\/en\/tag\/ai%e6%a8%a1%e5%9e%8b\" title=\"[View articles tagged with [AI models]]\" target=\"_blank\" >AI Models<\/a> \u2014\u2014Qwen2-VL. The power of this model lies in its ability to understand visual content, including pictures and videos, and can even analyze videos up to 20 minutes long in real time, which is quite powerful.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-18937\" title=\"3148b178j00sj0cro001ud000jj009om\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/08\/3148b178j00sj0cro001ud000jj009om.jpg\" alt=\"3148b178j00sj0cro001ud000jj009om\" width=\"703\" height=\"348\" \/><\/p>\n<p>It performs very well in third-party benchmarks compared to other leading state-of-the-art models such as Meta\u2019s Llama3.1, OpenAI\u2019s GPT-4o, Anthropic\u2019s Claude3Haiku, and Google\u2019s Gemini-1.5Flash.<\/p>\n<p>Alibaba evaluated the model&#039;s visual capabilities from six key dimensions: complex university-level problem solving, mathematical ability, document and table understanding, multilingual text image understanding, general scenario question answering, video understanding, and agent-based interaction. Its 72B model demonstrated top performance in most indicators, even surpassing closed-source models such as GPT-4o and Claude 3.5-Sonnet.<\/p>\n<p>The details are shown in the following figure:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-18938\" title=\"7d6c7dcdj00sj0cro0048d000oy00ekm\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/08\/7d6c7dcdj00sj0cro0048d000oy00ekm.jpg\" alt=\"7d6c7dcdj00sj0cro0048d000oy00ekm\" width=\"898\" height=\"524\" \/><\/p>\n<p><strong>Superior image and video analysis capabilities<\/strong><\/p>\n<p>Qwen2-VL can not only analyze static images<strong>, can also summarize the video content, answer questions related to it, and even provide online chat support in real time.<\/strong><\/p>\n<p>As the Qwen research team wrote in a blog post about the new Qwen2-VL series of models on GitHub: \u201cIn addition to static images, Qwen2-VL extends its capabilities to video content analysis. It can summarize video content, answer questions related to it, and maintain a continuous flow of conversation in real time, providing live chat support. This capability enables it to act as a personal assistant, helping users by providing insights and information extracted directly from video content.\u201d<\/p>\n<p>Officials say that it can analyze videos longer than 20 minutes and answer questions about the content. This means that Qwen2-VL can be a powerful assistant in online learning, technical support, or any other occasion where you need to understand the video content.<\/p>\n<p>Qwen2-VL&#039;s language capabilities are also quite powerful, supporting English, Chinese, and multiple European languages, Japanese, Korean, Arabic, Vietnamese, and other languages, allowing global users to use it easily. In order to help everyone better understand its capabilities, Alibaba also shared relevant application examples on their GitHub.<\/p>\n<p><strong>Three versions<\/strong><\/p>\n<p>This new model has three versions with different parameters, namely Qwen2-VL-72B (72 billion parameters), Qwen2-VL-7B and Qwen2-VL-2B. Among them, the 7B and 2B versions are provided under the open source and relaxed Apache2.0 license, allowing enterprises to use them for commercial purposes at will.<\/p>\n<p>The largest 72B version is not yet publicly available and is only available through a dedicated license and API.<\/p>\n<p>Qwen2-VL also introduces some new technical features, such as Naive Dynamic Resolution support, which can handle images of different resolutions to ensure consistency and accuracy of visual interpretation, and Multimodal Rotary Position Embedding (M-ROPE) system, which can synchronously capture and integrate position information between text, images and videos.<\/p>\n<p data-vmark=\"2541\">The model link is as follows:<\/p>\n<ul class=\"small-size list-paddingleft-2\">\n<li>\n<p data-vmark=\"b883\">Qwen2-VL-2B-Instruct:<a href=\"https:\/\/www.modelscope.cn\/models\/qwen\/Qwen2-VL-2B-Instruct\" target=\"_blank\" rel=\"noopener\"><span class=\"link-text-start-with-http\">https:\/\/www.modelscope.cn\/models\/qwen\/Qwen2-VL-2B-Instruct<\/span><\/a><\/p>\n<\/li>\n<li>\n<p data-vmark=\"5d16\">Qwen2-VL-7B-Instruct:<a href=\"https:\/\/www.modelscope.cn\/models\/qwen\/Qwen2-VL-7B-Instruct\" target=\"_blank\" rel=\"noopener\"><span class=\"link-text-start-with-http\">https:\/\/www.modelscope.cn\/models\/qwen\/Qwen2-VL-7B-Instruct<\/span><\/a><\/p>\n<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Alibaba's cloud computing division has just released a brand new AI model, Qwen2-VL. the model is powerful in that it can understand visual content, including images and videos, and can even analyze videos up to 20 minutes long in real time, which is pretty tough to say the least. It performs very well in third-party benchmarks compared to other leading state-of-the-art models such as Meta's Llama3.1, OpenAI's GPT-4o, Anthropic's Claude3Haiku and Google's Gemini-1.5Flash. Ali evaluates the model's visual capabilities on six key dimensions: complex college-level problem solving, mathematical ability, document and table comprehension, multilingual text graphs<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[167,390],"collection":[],"class_list":["post-18936","post","type-post","status-publish","format-standard","hentry","category-news","tag-ai","tag-390"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/18936","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=18936"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/18936\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=18936"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=18936"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=18936"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=18936"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}