{"id":30909,"date":"2025-03-17T14:24:12","date_gmt":"2025-03-17T06:24:12","guid":{"rendered":"https:\/\/www.1ai.net\/?p=30909"},"modified":"2025-03-17T14:24:25","modified_gmt":"2025-03-17T06:24:25","slug":"%e5%8f%97-deepseek-r1-%e5%90%af%e5%8f%91%ef%bc%8c%e5%b0%8f%e7%b1%b3%e5%a4%a7%e6%a8%a1%e5%9e%8b%e5%9b%a2%e9%98%9f%e7%99%bb%e9%a1%b6%e9%9f%b3%e9%a2%91%e6%8e%a8%e7%90%86-mmau-%e6%a6%9c","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/30909.html","title":{"rendered":"Inspired by DeepSeek-R1, Xiaomi's Big Modeling Team Tops Audio Inference MMAU Chart"},"content":{"rendered":"<p>March 17, @<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%b0%8f%e7%b1%b3\" title=\"[View articles tagged with [Xiaomi]]\" target=\"_blank\" >Millet<\/a>Technology official microblogging today said that Xiaomi<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%a4%a7%e6%a8%a1%e5%9e%8b\" title=\"[View articles tagged with [large models]]\" target=\"_blank\" >Large Model<\/a>Team in<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e9%9f%b3%e9%a2%91%e6%8e%a8%e7%90%86\" title=\"[Sees articles with [Audience reasoning] labels]\" target=\"_blank\" >Audio Reasoning<\/a>The team has made breakthrough progress in the field of multimodal audio understanding. Inspired by DeepSeek-R1, the team is the first to apply reinforcement learning algorithms to multimodal audio comprehension tasks, and in just one week, the team has topped the international authoritative MMAU audio comprehension review with a SOTA accuracy of 64.5%, which is now open-sourced.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-30910\" title=\"3265c835j00st98fa001od000ho00d5p\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/03\/3265c835j00st98fa001od000ho00d5p.jpg\" alt=\"3265c835j00st98fa001od000ho00d5p\" width=\"636\" height=\"473\" \/><\/p>\n<p>1AI attached the full official text below:<\/p>\n<p>Reinforcement Learning Demonstrates \"Counterintuitive\" Advantage -- Xiaomi's Big Model Team Tops Audio Reasoning MMAU Chart<\/p>\n<p>Faced with a recording of a car's cockpit in motion, can AI determine if there is a potential malfunction in the car? At a symphony performance, can AI predict the mood of the composer when creating the music? In the chaos of footsteps at a subway station during the morning rush hour, can AI predict the risk of a potential collision at the gate? In the era of big models, people are no longer satisfied with machines just recognizing the content of speech and the type of voice, but expect them to have the ability of complex reasoning.<\/p>\n<p>The MMAU (Massive Multi-Task Audio Understanding and Reasoning) review set (https:\/\/arxiv.org\/ abs \/ 2410.19168) is a quantitative scale of this audio reasoning ability, which is measured by 10,000 audio samples covering speech, ambient sound, and music, combined with 10,000 audio samples covering speech and ambient sound and music, combined with Q&amp;A pairs labeled by human experts, to test the model's performance on 27 skills, such as cross-scene reasoning, expertise, and other applications, with the expectation that the model achieves a level of logical analysis close to that of a human expert.<\/p>\n<p>As a benchmark ceiling, human experts on MMAU achieved an accuracy of 82.231 TP3T. This is a difficult set of reviews, and the best performing model on the current MMAU official list is GPT-4o from OpenAI, with an accuracy of 57.31 TP3T. A close second is Gemini 2.0 Flash from Google DeepMind, with an accuracy of 55.61 TP3T. Gemini 2.0 Flash from Google DeepMind, with an accuracy of 55.61 TP3T.<\/p>\n<p>The Qwen2-Audio-7B model from Ali has an accuracy of 49.21 TP3T on this review set. Due to its open-source nature, we attempted to fine-tune this model using a smaller dataset, the AVQA dataset released by Tsinghua University (https:\/\/mn.cs.tsinghua.edu.cn\/ avqa\/). The AVQA The AVQA dataset contains only 38,000 training samples, and with full supervised fine-tuning (SFT), the accuracy of the model on MMAU is improved to 51.8%. This is not a particularly significant improvement.<\/p>\n<p>The release of DeepSeek-R1 inspired our research on this task. DeepSeek-R1\u2019s Group Relative Policy Optimization (GRPO) approach allows models to evolve autonomously by simply \u201ctick-and-reward\u201d mechanisms, with human-like reasoning such as reflection, multi-step validation, etc. At the same time, a pre-print of the paper published by Carnegie Mellon University, \u201cAll Roads Lead to Likelihood: The Value of Renewal Learning in Fine-Tunning\u201d (https:\/\/arxiv.org\/abs\/25003.01067), drew an interesting conclusion from a delicate experiment: when the mission has a clear generation-validation Gap, that is, when the task's results are much more difficult than the validation of their validity, strengthening learning has a unique advantage over oversight fine-tuning<strong>And AQA tasks happen to be perfect for tasks with significant generation-validation gaps.<\/strong><\/p>\n<p>As an analogy, offline fine-tuning methods, such as SFT, are a bit like memorizing a question bank, where you can only train based on existing questions and answers, but may not be able to do the new questions you encounter; whereas reinforcement learning methods, such as GRPO, are like the teacher asking you to think of a few more answers, and then the teacher tells you which answer is good, so that you can actively think and stimulate your own ability, rather than being \" Fill-in-the-blank\" teaching. Of course, if the amount of training is sufficient, for example, some students are willing to spend many years to memorize the questions, may eventually achieve good results, but the efficiency is too low, wasting too much time. Active thinking, on the other hand, is more likely to achieve the effect of learning by example quickly. Reinforcement learning's real-time feedback may help the model target areas of the distribution of high-quality answers more quickly, whereas offline methods require traversing the entire possibility space, which is much less efficient.<\/p>\n<p>Based on the above insights.<strong>We try to migrate the GRPO algorithm of DeepSeek-R1 to the Qwen2-Audio-7B model<\/strong>. Surprisingly, with only 38,000 training samples from AVQA, the<strong>The reinforcement learning fine-tuned model achieves an accuracy of 64.5% on the MMAU review set, a result that is nearly 10 percentage points better than GPT-4o, the commercial closed-source model that is currently number one on the list.<\/strong><\/p>\n<p>Interestingly, when we force the model to output the reasoning process during training (similar to the traditional chain-of-consciousness approach), the accuracy instead drops to 61.11 TP3T. this suggests that explicit chain-of-consciousness result output may not be conducive to model training.<\/p>\n<p>Our experiments reveal several conclusions that are different from traditional perceptions:<\/p>\n<ul>\n<li>On fine-tuning methods: reinforcement learning significantly outperforms supervised learning on the 38,000 dataset with 570,000 datasets<\/li>\n<li>Regarding parameter size: compared to the hundreds of billions of models, a 7B parameter model can also show strong inference ability through reinforcement learning.<\/li>\n<li>On implicit reasoning: explicit thought chain output instead becomes a performance bottleneck<\/li>\n<\/ul>\n<p>Although the current accuracy rate has exceeded 64%, it is still far from the level of 82% of human experts. In our current experiments, the reinforcement learning strategy is still relatively rough, and the training process is not sufficiently guided by the chain of thought, which we will do further exploration in the follow-up.<\/p>\n<p>The experiment validated the unique value of enhanced learning in the field of audio reasoning and opened a new door for subsequent research. When the machine not only \"hears\" the sound, but also \"understands\" the causal logic behind the sound, the true age of intelligent hearing will come\u3002<\/p>\n<p>We open-source the training code, model parameters, and provide technical reports for academic industry reference and exchange.<\/p>\n<p data-vmark=\"18a7\">Training code:<a href=\"https:\/\/github.com\/xiaomi-research\/r1-aqa\" target=\"_blank\" rel=\"noopener\"><span class=\"link-text-start-with-http\">https:\/\/github.com\/xiaomi-research\/r1-aqa<\/span><\/a><\/p>\n<p data-vmark=\"d923\">Model parameters:<a href=\"https:\/\/huggingface.co\/mispeech\/r1-aqa\" target=\"_blank\" rel=\"noopener\"><span class=\"link-text-start-with-http\">https:\/\/huggingface.co\/mispeech\/r1-aqa<\/span><\/a><\/p>\n<p data-vmark=\"96d7\">Technical report:<a href=\"https:\/\/arxiv.org\/abs\/2503.11197\" target=\"_blank\" rel=\"noopener\"><span class=\"link-text-start-with-http\">https:\/\/arxiv.org\/abs\/2503.11197<\/span><\/a><\/p>\n<p data-vmark=\"7a76\">Interaction Demo:<a href=\"http:\/\/120.48.108.147:7860\/\" target=\"_blank\" rel=\"noopener\"><span class=\"link-text-start-with-http\">http:\/\/120.48.108.147:7860\/<\/span><\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>March 17 news, @ millet technology official microblogging said today, millet big model team in the field of audio reasoning breakthrough progress. Inspired by DeepSeek-R1, the team took the lead in applying reinforcement learning algorithms to multimodal audio comprehension tasks, and in just one week it topped the internationally authoritative MMAU audio comprehension evaluation list with a SOTA accuracy of 64.5%, which is now open sourced at the same time. 1AI with the official text as follows: Reinforcement Learning Demonstrates \"Counterintuitive\" Advantage -- Xiaomi's Big Model Team Tops Audio Reasoning MMAU Chart Faced with a recording of a car's cockpit while in motion, can AI determine whether there is a potential malfunction in the car? At a symphony performance, can AI predict the composer's mood when creating the music? Can AI predict the mood of the composer when he created the music?<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[216,1114,5997],"collection":[],"class_list":["post-30909","post","type-post","status-publish","format-standard","hentry","category-news","tag-216","tag-1114","tag-5997"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/30909","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=30909"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/30909\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=30909"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=30909"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=30909"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=30909"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}