{"id":2193,"date":"2023-12-23T09:43:05","date_gmt":"2023-12-23T01:43:05","guid":{"rendered":"https:\/\/www.1ai.net\/?p=2193"},"modified":"2023-12-23T09:43:05","modified_gmt":"2023-12-23T01:43:05","slug":"%e6%89%b3%e5%9b%9e%e4%b8%80%e5%b1%80%ef%bc%81gemini-pro%e5%a4%9a%e6%a8%a1%e6%80%81%e8%83%bd%e5%8a%9b%e5%92%8cgpt-4v%e4%b8%8d%e7%9b%b8%e4%b8%8a%e4%b8%8b","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/2193.html","title":{"rendered":"Back in the game! Gemini-Pro&#039;s multimodal capabilities are on par with GPT-4V"},"content":{"rendered":"<p>Recent<a href=\"https:\/\/www.1ai.net\/en\/tag\/gemini-pro\" title=\"[See articles with [Gemini-Pro] label]\" target=\"_blank\" >Gemini-Pro<\/a>The evaluation report shows that<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%a4%9a%e6%a8%a1%e6%80%81\" title=\"[View articles tagged with [multimodal]]\" target=\"_blank\" >Multimodality<\/a>The Gemini-Pro has made significant progress in the field of machine learning, and is comparable to GPT-4V, and even performs better in some aspects. First, in the comprehensive performance on the multimodal proprietary benchmark MME, Gemini-Pro surpassed GPT-4V with a high score of 1933.4, showing its comprehensive advantages in perception and cognition. Among the 37 visual understanding tasks, Gemini-Pro performed outstandingly in tasks such as text translation, color\/landmark\/person recognition, and OCR, showing its excellent capabilities in the field of basic perception.<\/p>\n<p class=\"article-content__img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2194\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2023\/12\/6383885329082789481304987.jpg\" alt=\"\" width=\"490\" height=\"438\" \/><\/p>\n<p>Paper address: https:\/\/arxiv.org\/pdf\/2312.12436.pdf<\/p>\n<p>Project address: https:\/\/github.com\/BradyFU\/Awesome-Multimodal-Large-Language-Models<\/p>\n<p>However, the evaluation also revealed differences between the two. In the celebrity recognition task, GPT-4V scored 0, mainly because it refused to answer related questions. In the location recognition task, both showed poor performance, showing their insensitivity to spatial location information. In addition, the open source model SPHINX is on par with or even better than GPT-4V and Gemini in perception tasks, but there is a large gap in cognition.<\/p>\n<p>The evaluation report is divided into basic perception,<span class=\"spamTxt\">advanced<\/span>Gemini-Pro&#039;s visual understanding capabilities were evaluated in detail in four areas: cognition, challenging visual tasks, and various expert capabilities. The basic perception test covers object-level perception, scene-level perception, and knowledge-based perception capabilities, among which Gemini-Pro performed outstandingly in tasks such as color\/landmark\/person recognition and OCR.<\/p>\n<p><span class=\"spamTxt\">advanced<\/span>The cognitive tests involved tasks such as text-rich visual reasoning, abstract visual reasoning, scientific problem solving, sentiment analysis, and intellectual games, showing that Gemini-Pro achieved good results in formula generation and abstract visual stimulation.<\/p>\n<p>Challenging visual tasks include referring expression understanding, object tracking, and visual story generation, in which Gemini-Pro demonstrated deep visual perception and understanding capabilities. Finally, various expert ability tests involved tasks such as defect detection and economic analysis, and Gemini-Pro showed excellent expertise in the analysis of stock price charts. However, the evaluation also pointed out that Gemini-Pro had hallucination problems in some tasks and needed further improvement.<\/p>\n<p>Gemini-Pro has achieved remarkable results in the field of multimodality, demonstrating its strong potential in visual understanding. However, the evaluation also highlights that there is still room for further improvement in specific tasks and fields. The performance of Gemini-Pro demonstrates the potential power of multimodal technology and provides useful inspiration for future research and applications.<\/p>","protected":false},"excerpt":{"rendered":"<p>The recent Gemini-Pro review report shows that it has made significant progress in the multimodal domain, on par with the GPT-4V, and even outperforming it in some aspects. First, in the overall performance on the proprietary multimodal benchmark MME, Gemini-Pro surpassed GPT-4V with a high score of 1933.4, demonstrating comprehensive advantages in perception and cognition. And among the 37 visual understanding tasks, Gemini-Pro excelled in text translation, color\/landmark\/people recognition, and OCR, demonstrating its superior ability in the basic perceptual domain. Paper address:https:\/\/arxiv.org\/pdf\/2312.12436.pdf Project address:https:\/\/gith<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[741,592],"collection":[],"class_list":["post-2193","post","type-post","status-publish","format-standard","hentry","category-news","tag-gemini-pro","tag-592"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/2193","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=2193"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/2193\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=2193"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=2193"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=2193"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=2193"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}