{"id":23415,"date":"2024-11-19T21:28:32","date_gmt":"2024-11-19T13:28:32","guid":{"rendered":"https:\/\/www.1ai.net\/?p=23415"},"modified":"2024-11-19T21:28:32","modified_gmt":"2024-11-19T13:28:32","slug":"%e5%8c%97%e5%a4%a7%e6%b8%85%e5%8d%8e%e7%ad%89%e8%81%94%e5%90%88%e5%8f%91%e5%b8%83-llava-o1%ef%bc%9a%e9%a6%96%e4%b8%aa%e8%87%aa%e5%8f%91%e6%80%a7%e8%a7%86%e8%a7%89ai%e6%a8%a1%e5%9e%8b%ef%bc%8c%e6%8e%a8","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/23415.html","title":{"rendered":"Peking University, Tsinghua University and others jointly release LLaVA-o1: the first spontaneous visual AI model, a new idea of inference computing Scaling"},"content":{"rendered":"<p>Nov. 19 - A research team from Peking University, Tsinghua University, Pengcheng Lab, Alibaba's Dharmo Academy, and Lehigh University, the<strong>The latest launch of the <a href=\"https:\/\/www.1ai.net\/en\/tag\/llava-o1\" title=\"_Other Organiser\" target=\"_blank\" >LLaVA-o1<\/a>This is the first GPT-o1-like systematic reasoning that is spontaneous, as explained at the end of this article.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e8%a7%86%e8%a7%89%e8%af%ad%e8%a8%80%e6%a8%a1%e5%9e%8b\" title=\"[View articles tagged with [visual language modeling]]\" target=\"_blank\" >visual language model<\/a>.<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-23416\" title=\"6bdc8cbej00sn79ca002ud000v90071p\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/11\/6bdc8cbej00sn79ca002ud000v90071p.jpg\" alt=\"6bdc8cbej00sn79ca002ud000v90071p\" width=\"1125\" height=\"253\" \/><\/p>\n<p>LLaVA-o1 is a novel visual language model (<a href=\"https:\/\/www.1ai.net\/en\/tag\/vlm\" title=\"_OTHER ORGANISER\" target=\"_blank\" >VLM<\/a>), which was designed with the goal of performing autonomous multi-stage reasoning.<\/p>\n<p>LLaVA-o1, with 11 billion parameters, was developed based on the Llama-3.2-Vision-Instruct model and designed with 4 reasoning stages: summary, caption, reasoning and conclusion.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-23417\" title=\"426f6d7fj00sn79cu0082d000v900q1p\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/11\/426f6d7fj00sn79cu0082d000v900q1p.jpg\" alt=\"426f6d7fj00sn79cu0082d000v900q1p\" width=\"1125\" height=\"937\" \/><\/p>\n<p>The model is fine-tuned using a dataset called LLaVA-o1-100k, derived from visual quizzing (VQA) sources and structured inference annotations generated by GPT-4o.<\/p>\n<p>LLaVA-o1 employs the inference time Scaling technique of stage-level beam search, which is capable of generating multiple candidate answers at each inference stage and selecting the best answer.<\/p>\n<p>The model has a strong ability to handle complex tasks, and can break through the limitations of traditional visual language models in complex visual question and answer tasks.<\/p>\n<p>Compared to the base model, LLaVA-o1 improves performance by 8.9% in multimodal inference benchmarks, outperforming many large and closed-source competitors.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-23418\" title=\"df97ae20j00sn79dv003md000v90068p\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/11\/df97ae20j00sn79dv003md000v90068p.jpg\" alt=\"df97ae20j00sn79dv003md000v90068p\" width=\"1125\" height=\"224\" \/><\/p>\n<p>The introduction of LLaVA-o1 fills an important gap between textual and visual question-and-answer models, and its excellent performance in several benchmark tests, especially in the area of reasoning about visual problems in math and science, demonstrates the importance of structured reasoning in visual language models.<\/p>\n<p>Spontaneous AI (Spontaneous AI) refers to AI systems that can mimic the spontaneous behavior of animals. Research in this technology has focused on how to design robots or intelligent systems with spontaneous behavior through machine learning and complex temporal patterns.<\/p>\n<p data-vmark=\"9e52\"><strong>Attach reference address<\/strong><\/p>\n<ul class=\"custom_reference list-paddingleft-1\">\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"795e\"><a href=\"https:\/\/www.bilibili.com\/video\/BV1TfUpYkEQg\/?spm_id_from=333.337.search-card.all.click&amp;vd_source=321d627018508d62cc1e20ef0f2c4ba4\" target=\"_blank\" rel=\"noopener\">Beihang University Releases Multimodal Large Model LLaVA-o1, a New Idea for Scaling Inference Computing<\/a><\/p>\n<\/li>\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"4799\"><a href=\"https:\/\/www.marktechpost.com\/2024\/11\/18\/meet-llava-o1-the-first-visual-language-model-capable-of-spontaneous-systematic-reasoning-similar-to-gpt-o1\/\" target=\"_blank\" rel=\"noopener\">Meet LLaVA-o1: The First Visual Language Model Capable of Spontaneous, Systematic Reasoning Similar to GPT-o1<\/a><\/p>\n<\/li>\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"31c6\"><a href=\"https:\/\/arxiv.org\/abs\/2411.10440\" target=\"_blank\" rel=\"noopener\">LLaVA-o1: Let Vision Language Models Reason Step-by-Step<\/a><\/p>\n<\/li>\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"0605\"><a href=\"https:\/\/github.com\/PKU-YuanGroup\/LLaVA-o1\" target=\"_blank\" rel=\"noopener\">Github<\/a><\/p>\n<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Nov. 19, 2011 - A team of researchers from Peking University, Tsinghua University, Pengcheng Lab, Alibaba Dharmo Academy, and Lehigh University has introduced LLaVA-o1, the first GPT-o1-like systematic inference visual language model that is spontaneous, which can be explained at the end of the article. language model. LLaVA-o1 is a novel visual language model (VLM) designed for autonomous multi-stage reasoning. With 11 billion parameters, LLaVA-o1 is based on the Llama-3.2-Vision-Instruct model and is designed to summarize, describe (LLaVA-o1), and summarize (LLaVA-o1).<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[167,4980,4982,4981],"collection":[],"class_list":["post-23415","post","type-post","status-publish","format-standard","hentry","category-news","tag-ai","tag-llava-o1","tag-vlm","tag-4981"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/23415","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=23415"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/23415\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=23415"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=23415"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=23415"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=23415"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}