{"id":15798,"date":"2024-07-18T08:42:57","date_gmt":"2024-07-18T00:42:57","guid":{"rendered":"https:\/\/www.1ai.net\/?p=15798"},"modified":"2024-07-18T08:42:57","modified_gmt":"2024-07-18T00:42:57","slug":"%e6%99%ba%e6%ba%90%e7%a0%94%e7%a9%b6%e9%99%a2%e6%8e%a8%e5%87%ba%e6%96%b0%e4%b8%80%e4%bb%a3%e6%97%a0%e7%bc%96%e7%a0%81%e5%99%a8%e8%a7%86%e8%a7%89%e8%af%ad%e8%a8%80%e5%a4%9a%e6%a8%a1%e6%80%81%e5%a4%a7","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/15798.html","title":{"rendered":"AI Research Institute launches a new generation of encoder-free visual language multimodal large model EVE"},"content":{"rendered":"<p>Recently,<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%a4%9a%e6%a8%a1%e6%80%81%e5%a4%a7%e6%a8%a1%e5%9e%8b\" title=\"[Sees articles with [Multimodal Large Model] labels]\" target=\"_blank\" >Multimodal large model<\/a>The research and application of multimodal models have made significant progress. Foreign companies such as OpenAI, Google, and Microsoft have launched a series of advanced models, and domestic institutions such as Zhipu AI and Jieyuexingchen have also made breakthroughs in this field. These models usually rely on visual encoders to extract visual features and combine them with large language models, but there is a problem of visual induction bias caused by training separation, which limits the deployment efficiency and performance of large multimodal models.<\/p>\n<p>To solve these problems,<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e6%99%ba%e6%ba%90%e7%a0%94%e7%a9%b6%e9%99%a2\" title=\"[Sees articles with tags]\" target=\"_blank\" >AI Research Institute<\/a>In collaboration with Dalian University of Technology, Peking University and other universities, we launched a new generation of encoder-free visual language model EVE. EVE integrates visual-language representation, alignment and reasoning into a unified pure decoder architecture through refined training strategies and additional visual supervision. Using public data, EVE performs well in multiple visual-language benchmarks, approaching or even outperforming mainstream encoder-based multimodal methods.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-15799\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/07\/6385682080913376559203452.jpg\" alt=\"\" width=\"1000\" height=\"359\" \/><\/p>\n<p>The main features of EVE include:<\/p>\n<ul>\n<li>Native visual language model: removes the visual encoder, handles arbitrary image aspect ratios, and is significantly better than the similar Fuyu-8B model.<\/li>\n<li>Low data and training cost: Pre-training uses public data such as OpenImages, SAM, and LAION, and the training time is short.<\/li>\n<li>Transparent and Efficient Exploration: Providing an efficient and transparent development path for decoder-only native multimodal architectures.<\/li>\n<\/ul>\n<p>Model structure:<\/p>\n<ul>\n<li>Patch Embedding Layer: Obtain the image 2D feature map through a single convolution layer and an average pooling layer to enhance local features and global information.<\/li>\n<li>Patch Aligning Layer: Integrates multi-layer network visual features to achieve fine-grained alignment with the output of the visual encoder.<\/li>\n<\/ul>\n<p>Training strategy:<\/p>\n<ul>\n<li>Large language model-guided pre-training phase: establishing an initial connection between vision and language.<\/li>\n<li>Generative pre-training phase: Improve the model&#039;s ability to understand visual-linguistic content.<\/li>\n<li>Supervised fine-tuning phase: regularizes the model\u2019s ability to follow language instructions and learn conversational patterns.<\/li>\n<\/ul>\n<p>Quantitative analysis: EVE performs well on multiple visual language benchmarks and is comparable to a variety of mainstream encoder-based visual language models. Despite the challenges in accurately responding to specific instructions, through efficient training strategies, EVE achieves comparable performance to visual language models with encoder bases.<\/p>\n<p>EVE demonstrates the potential of encoder-free native visual language models, and may continue to promote the development of multimodal models in the future through further performance improvements, optimization of encoder-free architectures, and construction of native multimodality.<\/p>\n<p>Paper address: https:\/\/arxiv.org\/abs\/2406.11832<\/p>\n<p>Project code: https:\/\/github.com\/baaivision\/EVE<\/p>\n<p>Model address: https:\/\/huggingface.co\/BAAI\/EVE-7B-HD-v1.0<\/p>","protected":false},"excerpt":{"rendered":"<p>Recently, the research and application of multimodal large models have made significant progress. Foreign companies such as OpenAI, Google, and Microsoft have introduced a series of advanced models, and domestic organizations such as Smart Spectrum AI and Step Star have made breakthroughs in this field. These models usually rely on visual coders to extract visual features and combine them with large language models, but suffer from visual generalization bias due to training separation, which limits the deployment efficiency and performance of multimodal large models. To address these issues, Zhiyuan Research Institute, in conjunction with Dalian University of Technology and Peking University, has introduced EVE, a next-generation encoder-less visual language model.EVE integrates visual-linguistic representations, alignment, and inference into a unified pure decoder architecture through a refined training strategy and additional visual supervision. Using the public<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[2258,602,1739],"collection":[],"class_list":["post-15798","post","type-post","status-publish","format-standard","hentry","category-news","tag-eve","tag-602","tag-1739"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/15798","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=15798"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/15798\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=15798"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=15798"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=15798"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=15798"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}