{"id":42098,"date":"2025-09-01T16:55:18","date_gmt":"2025-09-01T08:55:18","guid":{"rendered":"https:\/\/www.1ai.net\/?p=42098"},"modified":"2025-09-01T16:55:18","modified_gmt":"2025-09-01T08:55:18","slug":"%e9%98%b6%e8%b7%83%e6%98%9f%e8%be%b0%e5%8f%91%e5%b8%83%e7%ab%af%e5%88%b0%e7%ab%af%e8%af%ad%e9%9f%b3%e5%a4%a7%e6%a8%a1%e5%9e%8b-step-audio-2-mini%ef%bc%8c%e5%a4%9a%e4%b8%aa%e5%9f%ba%e5%87%86%e6%b5%8b","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/42098.html","title":{"rendered":"Step-Audio releases Step-Audio 2 mini, a large end-to-end voice model, with SOTA scores in multiple benchmarks"},"content":{"rendered":"<p>September 1 News.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e9%98%b6%e8%b7%83%e6%98%9f%e8%be%b0\" title=\"[View articles tagged with [Step Star]]\" target=\"_blank\" >Step Star<\/a>Open Source Released Today<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e7%ab%af%e5%88%b0%e7%ab%af\" title=\"[View articles tagged with [end-to-end]]\" target=\"_blank\" >end-to-end<\/a><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e8%af%ad%e9%9f%b3%e5%a4%a7%e6%a8%a1%e5%9e%8b\" title=\"[View articles tagged with [Voice Megamodel]]\" target=\"_blank\" >Voice big model<\/a> Step-Audio 2 mini, a model that has achieved SOTA scores on several international benchmark test sets, is now available on the Step-Star Open Platform.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-42099\" title=\"2b249555j00t1wjex003zd000v900fkp\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/09\/2b249555j00t1wjex003zd000v900fkp.jpg\" alt=\"2b249555j00t1wjex003zd000v900fkp\" width=\"1125\" height=\"560\" \/><\/p>\n<p>1AI learns from the official introduction that it models speech understanding, audio reasoning and generation in a unified way, and is the first to support voice-native Tool Calling capabilities.<strong>Enables network search and other operations<\/strong>.<\/p>\n<p>Step-Audio 2 mini achieves SOTA scores in several key benchmarks, excelling in audio understanding, speech recognition, translation and dialog scenarios.<strong>Outperforms all open source end-to-end speech models including Qwen-Omni and Kimi-Audio in comprehensive performance<\/strong>and outperforms GPT-4o Audio on most tasks.<\/p>\n<ul>\n<li>Step-Audio 2 mini topped the list of open source end-to-end speech models with a score of 73.2 on MMAU, a generalized multimodal audio understanding test set;<\/li>\n<li>On the URO Bench, which measures spoken conversation ability, Step-Audio 2 mini received the highest scores for open source end-to-end speech modeling in both the basic and professional tracks, demonstrating excellent conversation comprehension and expression;<\/li>\n<li>On the Chinese-English translation task, Step-Audio 2 mini has a clear advantage, scoring 39.3 and 29.1 on the CoVoST 2 and CVSS evaluation sets, respectively, and is significantly ahead of GPT-4o Audio and other open-source speech models;<\/li>\n<li>In the speech recognition task, Step-Audio 2 mini achieves the first place in multi-language and multi-dialect. The average CER (Character Error Rate) of the open-source Chinese test set is 3.19, and the average WER (Word Error Rate) of the open-source English test set is 3.50, which is more than 15% ahead of other open-source models.<\/li>\n<\/ul>\n<p>In the past, AI voice was often criticized for its low IQ and emotional intelligence. First, it is \"no knowledge\", lacking the knowledge reserve and reasoning ability of a large text model; second, it is \"cold\", not understanding the subtext, tone of voice, emotion, laughter, and other \"string sounds\". \"Step-Audio 2 mini effectively solves the problems of previous speech models through innovative architectural design.<\/p>\n<ul>\n<li><strong>True end-to-end multimodal architecture<\/strong>Step-Audio 2 mini breaks through the traditional ASR+LLM+TTS three-stage structure and realizes the direct conversion from raw audio input to voice response output, with a simpler architecture, lower latency, and effective understanding of paralinguistic information and non-vocal signals.<\/li>\n<\/ul>\n<ul>\n<li><strong>CoT\u00a0<\/strong><strong>Reasoning combined with intensive learning:<\/strong>Step-Audio 2 mini introduces for the first time Chain-of-Thought (CoT) reasoning and reinforcement learning co-optimization in an end-to-end speech model, which enables fine-grained understanding, reasoning and natural response to paralinguistic and non-speech signals such as emotion, intonation, music, and more.<\/li>\n<li><strong>Audio knowledge enhancement:<\/strong>The model support includes external tools such as web retrieval, which helps the model to solve the illusion problem and empowers the model on multi-scene extensions.<\/li>\n<\/ul>\n<p>GitHub: https:\/\/github.com\/stepfun-ai\/Step-Audio2<\/p>\n<p>Hugging Face: https:\/\/huggingface.co\/stepfun-ai\/Step-Audio-2-mini<\/p>\n<p>ModelScope: https:\/\/www.modelscope.cn\/models\/stepfun-ai\/Step-Audio-2-mini<\/p>","protected":false},"excerpt":{"rendered":"<p>September 1, 2011 - Step-Audio today released Step-Audio 2 mini, an open-source end-to-end speech model that has achieved SOTA on multiple international benchmark test sets, and is now available on the Step-Audio 2 mini open platform. Step-Audio 2 mini is now available on Step-Audio Open Platform. 1AI learns from the official introduction that Step-Audio 2 mini models speech understanding, audio reasoning and generation in a unified way, and is the first one to support speech-native Tool Calling capability, which enables networked search and other operations. Step-Audio 2 mini has achieved SOTA in several key benchmarks, with outstanding performance in audio understanding, speech recognition, translation, and dialog scenes, and its comprehensive performance surpasses that of Qwen-Omni and Kimi-A.<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[6126,4061,1893],"collection":[],"class_list":["post-42098","post","type-post","status-publish","format-standard","hentry","category-news","tag-6126","tag-4061","tag-1893"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/42098","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=42098"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/42098\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=42098"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=42098"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=42098"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=42098"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}