{"id":33877,"date":"2025-04-23T17:42:29","date_gmt":"2025-04-23T09:42:29","guid":{"rendered":"https:\/\/www.1ai.net\/?p=33877"},"modified":"2025-04-23T19:00:55","modified_gmt":"2025-04-23T11:00:55","slug":"%e8%8b%b1%e4%bc%9f%e8%be%be%e5%8f%91%e5%b8%83-eagle-2-5-%e8%a7%86%e8%a7%89%e8%af%ad%e8%a8%80ai%e6%a8%a1%e5%9e%8b%ef%bc%9a8b-%e5%8f%82%e6%95%b0%e5%aa%b2%e7%be%8e-gpt-4o","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/33877.html","title":{"rendered":"NVIDIA Releases Eagle 2.5 Visual Language AI Model: 8B Parameters Comparable to GPT-4o"},"content":{"rendered":"<p>April 23 - Technology media outlet marktechpost published a blog post yesterday, April 22, reporting that<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e8%8b%b1%e4%bc%9f%e8%be%be\" title=\"Look at the article with the label\" target=\"_blank\" >Nvidia<\/a>Newly Launched <a href=\"https:\/\/www.1ai.net\/en\/tag\/eagle-2-5\" title=\"[See articles with [Eagle 2.5] label]\" target=\"_blank\" >Eagle 2.5<\/a>, a visual-linguistic model (VLM) focused on long context multimodal learning.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-33878\" title=\"bee377adj00sv606e004cd000sg00jsp\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/04\/bee377adj00sv606e004cd000sg00jsp.jpg\" alt=\"bee377adj00sv606e004cd000sg00jsp\" width=\"1024\" height=\"712\" \/><\/p>\n<p>The model focuses on understanding large-scale video and images, and is particularly good at processing high-resolution images and long video sequences. Despite having a parameter size of only 8B, Eagle 2.5 scored 72.4% on the Video-MME benchmark (512 frame input), which is comparable to much larger models such as Qwen2.5-VL-72B and InternVL2.5-78B.<\/p>\n<p><strong>Innovative training strategies<\/strong><\/p>\n<p>The success of Eagle 2.5 could not have been achieved without two key training strategies: Information-First Sampling and Progressive Post-Training.<\/p>\n<p>Information Priority Sampling preserves more than 60% of the original image area while reducing aspect ratio distortion through Image Area Preservation (IAP) technology, while Automatic Degradation Sampling (ADS) dynamically balances the visual and textual inputs based on contextual lengths, ensuring textual integrity and optimization of visual details.<\/p>\n<p>Progressive post-training gradually extends the model context window from 32K to 128K tokens, allowing the model to maintain stable performance under different input lengths and avoid overfitting a single context range. These strategies, combined with SigLIP visual coding and MLP projection layers, ensure model flexibility in diverse tasks.<\/p>\n<p><strong>Customized data sets<\/strong><\/p>\n<p>The training data pipeline for Eagle 2.5 integrates open-source resources and a customized dataset, Eagle-Video-110K, which is designed for understanding long videos with double labeling.<\/p>\n<p>The top-down approach uses story-level segmentation combined with human-labeled chapter metadata, dense descriptions generated by GPT-4; the bottom-up approach utilizes <a href=\"https:\/\/www.1ai.net\/en\/tag\/gpt-4o\" title=\"[View articles tagged with [GPT-4o]]\" target=\"_blank\" >GPT-4o<\/a> Generate Q&amp;A pairs for short clips to capture spatio-temporal details.<\/p>\n<p>Filtered by cosine similarity, the dataset emphasizes diversity rather than redundancy, ensuring narrative coherence and fine-grained annotation, which significantly improves the model's performance in high-frame-count (\u2265128 frames) tasks.<\/p>\n<p><strong>Performance<\/strong><\/p>\n<p>The Eagle 2.5-8B performs well in several video and image comprehension tasks. In the video benchmark, MVBench scored 74.8, MLVU 77.6, and LongVideoBench 66.4; in the image benchmark, DocVQA scored 94.1, ChartQA 87.5, and InfoVQA 80.4.<\/p>\n<p>Ablation studies have shown that the removal of IAP and ADS results in performance degradation, while progressive training and the addition of the Eagle-Video-110K dataset result in more consistent improvements.<\/p>\n<p data-vmark=\"fd50\"><span class=\"referenceTitle\">1AI Attach reference address<\/span><\/p>\n<ul class=\"custom_reference list-paddingleft-1\">\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"e800\"><a href=\"https:\/\/arxiv.org\/abs\/2504.15271\" target=\"_blank\" rel=\"noopener\">Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models<\/a><\/p>\n<\/li>\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"4287\"><a href=\"https:\/\/github.com\/NVlabs\/EAGLE\" target=\"_blank\" rel=\"noopener\">GitHub Page<\/a><\/p>\n<\/li>\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"af86\"><a href=\"https:\/\/nvlabs.github.io\/EAGLE\/\" target=\"_blank\" rel=\"noopener\">project page<\/a><\/p>\n<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>April 23, technology media marktechpost yesterday (April 22) published a blog post, reported that NVIDIA's latest launch of Eagle 2.5, a visual-linguistic model (VLM) that focuses on long context multimodal learning. The model focuses on understanding large-scale video and images, and is particularly good at processing high-resolution images and long video sequences. Despite a parameter size of only 8B, Eagle 2.5 scores 72.4% on the Video-MME benchmark (512-frame input), which is comparable to larger models such as Qwen2.5-VL-72B and InternVL2.5-78B. Innovative Training Strategies The success of Eagle 2.5 was made possible by two factors.<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[6421,2582,239,6420],"collection":[],"class_list":["post-33877","post","type-post","status-publish","format-standard","hentry","category-news","tag-eagle-2-5","tag-gpt-4o","tag-239","tag-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/33877","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=33877"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/33877\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=33877"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=33877"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=33877"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=33877"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}