{"id":3429,"date":"2024-01-31T09:31:09","date_gmt":"2024-01-31T01:31:09","guid":{"rendered":"https:\/\/www.1ai.net\/?p=3429"},"modified":"2024-01-31T09:31:09","modified_gmt":"2024-01-31T01:31:09","slug":"%e5%be%ae%e8%bd%af%e5%bc%80%e6%ba%90%e5%a4%9a%e6%a8%a1%e6%80%81%e6%a8%a1%e5%9e%8bllava-1-5%e5%aa%b2%e7%be%8egpt-4v%e6%95%88%e6%9e%9c","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/3429.html","title":{"rendered":"Microsoft&#039;s open source multimodal model LLaVA-1.5 is comparable to GPT-4V"},"content":{"rendered":"<p><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%be%ae%e8%bd%af\" title=\"[View articles tagged with [Microsoft]]\" target=\"_blank\" >Microsoft<\/a><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%bc%80%e6%ba%90\" title=\"[View articles tagged with [open source]]\" target=\"_blank\" >Open Source<\/a>Be<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%a4%9a%e6%a8%a1%e6%80%81%e6%a8%a1%e5%9e%8b\" title=\"[View articles tagged with [multimodal model]]\" target=\"_blank\" >Multimodal Model<\/a><a href=\"https:\/\/www.1ai.net\/en\/tag\/llava-1-5\" title=\"_Other Organiser\" target=\"_blank\" >LLaVA-1.5<\/a>, inheriting the LLaVA architecture and introducing new features. The researchers tested it in visual question answering, natural language processing, image generation, etc. and showed that LLaVA-1.5 reached the level of open source models.<span class=\"spamTxt\">Highest<\/span>level, comparable to the effect of GPT-4V.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3430\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/6384228852298686258242828.png\" alt=\"\" width=\"577\" height=\"583\" \/><\/p>\n<p>The model consists of three parts: a visual model, a large language model, and a visual language connector. The visual model uses the pre-trained CLIP ViT-L\/336px. Through CLIP encoding, a fixed-length vector representation can be obtained to improve the representation of image semantic information. Compared with the previous version, the CLIP model parameters and input resolution have been significantly improved.<\/p>\n<p>The large language model uses Vicuna v1.5, which has 13 billion parameters, to understand user input text and capture semantic information, with strong reasoning and generation capabilities. Unlike methods that only tune the image encoder, LLaVA-1.5 updates the parameters of the large language model during training, allowing it to directly learn how to integrate visual information for reasoning, improving model autonomy.<\/p>\n<p>In terms of visual language connectors, LLaVA-1.5 uses a two-layer MLP connector instead of linear projection to effectively map the CLIP encoder output to the word vector space of the large language model.<\/p>\n<p>In terms of the training process, LLaVA-1.5 follows a two-stage training method. First, pre-training of visual language representation is performed, using about 600,000 image-text pairs, and the training time is about 1 hour. Then, tuning is performed on 650,000 multimodal instruction data, and the training time is about 20 hours. This efficient two-stage training ensures the convergence of the model and completes the entire process within one day, which greatly reduces the AI computing power and time cost compared to other models.<\/p>\n<p>The researchers also designed matching response format prompts to guide the model to adjust the output form according to the interaction type to meet the needs of specific scenarios. In terms of visual instruction tuning, LLaVA-1.5 uses different types of data sets, including VQA, OCR, regional VQA, visual dialogue, language dialogue, etc., totaling about 650,000 data, to provide the model with rich visual scene reasoning and interaction methods.<\/p>\n<p>LLaVA-1.5 has made significant progress in the multimodal field, and through open source, it has promoted its widespread application in visual question answering, natural language processing, image generation, and other fields.<\/p>","protected":false},"excerpt":{"rendered":"<p>Microsoft has open-sourced the multimodal model LLaVA-1.5, inheriting the LLaVA architecture and introducing new features. Researchers tested it in visual question and answer, natural language processing, image generation, etc. showed that LLaVA-1.5 reached the highest level in the open source model, comparable to the GPT-4V effect. The model consists of three major blocks: the visual model, the large language model and the visual language connector. Among them, the visual model uses the pre-trained CLIP ViT-L\/336px, which can be encoded by CLIP to obtain a fixed-length vector representation and enhance the image semantic information representation. Compared with the previous version, the CLIP model parameters and input resolution are significantly improved. The large language model adopts Vicuna v1.5 with 13 billion parameters for understanding the semantic information represented with the<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[1097,1096,219,280],"collection":[],"class_list":["post-3429","post","type-post","status-publish","format-standard","hentry","category-news","tag-llava-1-5","tag-1096","tag-219","tag-280"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/3429","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=3429"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/3429\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=3429"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=3429"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=3429"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=3429"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}