{"id":2727,"date":"2024-01-10T09:25:14","date_gmt":"2024-01-10T01:25:14","guid":{"rendered":"https:\/\/www.1ai.net\/?p=2727"},"modified":"2024-01-10T09:25:14","modified_gmt":"2024-01-10T01:25:14","slug":"%e5%8f%82%e6%95%b0%e5%b0%8f%ef%bc%8c%e6%80%a7%e8%83%bd%e5%bc%ba%ef%bc%81%e5%bc%80%e6%ba%90%e5%a4%9a%e6%a8%a1%e6%80%81%e6%a8%a1%e5%9e%8b-tinygpt-v","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/2727.html","title":{"rendered":"Small parameters, strong performance! Open source multimodal model - TinyGPT-V"},"content":{"rendered":"<p>Researchers from Anhui University of Technology, Nanyang Technological University, and Lehigh University<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%bc%80%e6%ba%90\" title=\"[View articles tagged with [open source]]\" target=\"_blank\" >Open Source<\/a>Be<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%a4%9a%e6%a8%a1%e6%80%81%e5%a4%a7%e6%a8%a1%e5%9e%8b\" title=\"[Sees articles with [Multimodal Large Model] labels]\" target=\"_blank\" >Multimodal large model<\/a>\u2014\u2014<a href=\"https:\/\/www.1ai.net\/en\/tag\/tinygpt-v\" title=\"_Other Organiser\" target=\"_blank\" >TinyGPT-V<\/a>.<\/p>\n<p>TinyGPT-V uses Microsoft&#039;s open source Phi-2 as the basic large language model, and also uses the visual model EVA to achieve multimodal capabilities.<strong>Although TinyGPT-V has only 2.8 billion parameters, its performance is comparable to models with tens of billions of parameters.<\/strong>.<\/p>\n<p>In addition, TinyGPT-V training only requires a 24G GPU, and does not require high-end graphics cards such as A100 and H100 for training.<\/p>\n<p>Therefore, it is very suitable for small and medium-sized enterprises and individual developers, and can be deployed on mobile devices such as mobile phones and laptops.<\/p>\n<p>Open source address: https:\/\/github.com\/DLYuanGod\/TinyGPT-V<\/p>\n<p>Paper address: https:\/\/arxiv.org\/abs\/2312.16862<\/p>\n<p class=\"article-content__img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2729\" title=\"2024011008592022540\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/2024011008592022540.jpg\" alt=\"2024011008592022540\" width=\"554\" height=\"235\" \/><\/p>\n<p><strong>TinyGPT-V main architecture<\/strong><\/p>\n<p>TinyGPT-V mainly consists of three major parts: the large language model Phi-2, the visual encoder, and the linear projection layer.<\/p>\n<p>Developers chose Microsoft<span class=\"spamTxt\">up to date<\/span>The open source Phi-2 is used as the basic large language model of TinyGPT-V. Phi-2 has only 2.7 billion parameters, but its understanding and reasoning capabilities are very strong. In many complex benchmark tests, it has achieved results close to or exceeding those of a large 13 billion parameter model.<\/p>\n<p>The visual encoder uses the same architecture as MiniGPT-v2, based on ViT&#039;s EVA model. This is a pre-trained visual base model that remains frozen throughout the training of TinyGPT-V.<\/p>\n<p class=\"article-content__img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2730\" title=\"2024011008592022551\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/2024011008592022551.jpg\" alt=\"2024011008592022551\" width=\"554\" height=\"372\" \/><\/p>\n<p><strong>The role of the linear projection layer is to embed the image features extracted by the visual encoder into the large language model so that the large language model can understand the image information.<\/strong>.<\/p>\n<p>TinyGPT-V<span class=\"spamTxt\">First<\/span>The linear projection layer uses the Q-Former structure from BLIP-2, which can<span class=\"spamTxt\">maximum<\/span>Degree reuse of BLIP-2 pre-training results.<\/p>\n<p>The second linear projection layer is initialized with a new Gaussian distribution to bridge the dimensionality gap between the output of the previous layer and the language model embedding layer.<\/p>\n<p><strong>TinyGPT-V training process<\/strong><\/p>\n<p>The training of TinyGPT-V went through four stages, and the data sets and experimental processes used in each stage were different.<\/p>\n<p><strong><span class=\"spamTxt\">First<\/span>The warm-up training phase<\/strong>The purpose is to adapt the Phi-2 model to image-based input. The training data used in this phase includes three datasets: Conceptual Caption, SBU, and LAION, totaling about 5 million images and corresponding description texts.<\/p>\n<p><strong>The second stage is pre-training<\/strong>, the purpose is to further reduce the loss on the image-text pair. This stage also uses<span class=\"spamTxt\">First<\/span>The conceptual caption, SBU and LAION datasets of the same stage are used in the experiment. There are 4 stages, and each stage has 5000 iterations.<\/p>\n<p class=\"article-content__img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2728\" title=\"2024011008592022552\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/2024011008592022552.jpg\" alt=\"2024011008592022552\" width=\"554\" height=\"405\" \/><\/p>\n<p><strong>The third stage is instruction tuning<\/strong>, using MiniGPT-4 and LLaVA to train the model on some image-text pairs with instructions, such as &quot;Describe the content of this picture.&quot;<\/p>\n<p><strong>The fourth stage is multi-task tuning<\/strong>This phase uses more complex and rich multimodal datasets, such as sentences with complex semantic alignment in LLaVA, object parsing datasets in Flickr30K, multi-task mixed corpora, and plain text corpora.<\/p>\n<p>At the same time, a learning rate strategy similar to that in the second stage was adopted, which eventually reduced the loss from 2.720 to 1.399.<\/p>\n<p>To test the performance of TinyGPT-V, the researchers evaluated its performance on multiple visual language tasks such as visual question answering, visual-spatial reasoning, and image caption generation from multiple perspectives.<\/p>\n<p class=\"article-content__img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2731\" title=\"2024011008592022553\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/01\/2024011008592022553.jpg\" alt=\"2024011008592022553\" width=\"554\" height=\"217\" \/><\/p>\n<p>The results show that TinyGPT-V has small parameters but very strong performance. For example, in the VSR spatial reasoning task, it surpassed all the models tested with an accuracy of 53.2%.<\/p>","protected":false},"excerpt":{"rendered":"<p>Anhui Engineering University, Nanyang Polytechnic University and the University of Science and Technology opened up a large multi-modular model, TinyGPT-V. TinyGPT-V uses the Microsoft Open Source Phi-2 as the base language model, while the visual model Eva is used to achieve multi-modular capability. Although TinyGPT-V has only 2.8 billion parameters, its performance is comparable to a model of billions of parameters. Furthermore, TinyGPT-V training can only be completed by 24 GPU and does not require A100 and H100 high-end graphic cards. It is therefore very applicable to small and medium-sized enterprises and individual developers and can be deployed on mobile devices such as mobile phones, notebooks, etc. Open Source Address: https:\/\/github.com\/DLYuanGod\/T<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[881,602,219],"collection":[],"class_list":["post-2727","post","type-post","status-publish","format-standard","hentry","category-news","tag-tinygpt-v","tag-602","tag-219"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/2727","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=2727"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/2727\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=2727"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=2727"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=2727"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=2727"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}