Step-Audio releases Step-Audio 2 mini, a large end-to-end voice model, with SOTA scores in multiple benchmarks

September 1 News.Step StarOpen Source Released Todayend-to-endVoice big model Step-Audio 2 mini, a model that has achieved SOTA scores on several international benchmark test sets, is now available on the Step-Star Open Platform.

Step-Audio releases Step-Audio 2 mini, a large end-to-end voice model, with SOTA scores in multiple benchmarks

1AI learns from the official introduction that it models speech understanding, audio reasoning and generation in a unified way, and is the first to support voice-native Tool Calling capabilities.Enables network search and other operations.

Step-Audio 2 mini achieves SOTA scores in several key benchmarks, excelling in audio understanding, speech recognition, translation and dialog scenarios.Outperforms all open source end-to-end speech models including Qwen-Omni and Kimi-Audio in comprehensive performanceand outperforms GPT-4o Audio on most tasks.

  • Step-Audio 2 mini topped the list of open source end-to-end speech models with a score of 73.2 on MMAU, a generalized multimodal audio understanding test set;
  • On the URO Bench, which measures spoken conversation ability, Step-Audio 2 mini received the highest scores for open source end-to-end speech modeling in both the basic and professional tracks, demonstrating excellent conversation comprehension and expression;
  • On the Chinese-English translation task, Step-Audio 2 mini has a clear advantage, scoring 39.3 and 29.1 on the CoVoST 2 and CVSS evaluation sets, respectively, and is significantly ahead of GPT-4o Audio and other open-source speech models;
  • In the speech recognition task, Step-Audio 2 mini achieves the first place in multi-language and multi-dialect. The average CER (Character Error Rate) of the open-source Chinese test set is 3.19, and the average WER (Word Error Rate) of the open-source English test set is 3.50, which is more than 15% ahead of other open-source models.

In the past, AI voice was often criticized for its low IQ and emotional intelligence. First, it is "no knowledge", lacking the knowledge reserve and reasoning ability of a large text model; second, it is "cold", not understanding the subtext, tone of voice, emotion, laughter, and other "string sounds". "Step-Audio 2 mini effectively solves the problems of previous speech models through innovative architectural design.

  • True end-to-end multimodal architectureStep-Audio 2 mini breaks through the traditional ASR+LLM+TTS three-stage structure and realizes the direct conversion from raw audio input to voice response output, with a simpler architecture, lower latency, and effective understanding of paralinguistic information and non-vocal signals.
  • CoT Reasoning combined with intensive learning:Step-Audio 2 mini introduces for the first time Chain-of-Thought (CoT) reasoning and reinforcement learning co-optimization in an end-to-end speech model, which enables fine-grained understanding, reasoning and natural response to paralinguistic and non-speech signals such as emotion, intonation, music, and more.
  • Audio knowledge enhancement:The model support includes external tools such as web retrieval, which helps the model to solve the illusion problem and empowers the model on multi-scene extensions.

GitHub: https://github.com/stepfun-ai/Step-Audio2

Hugging Face: https://huggingface.co/stepfun-ai/Step-Audio-2-mini

ModelScope: https://www.modelscope.cn/models/stepfun-ai/Step-Audio-2-mini

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

"Stargate" AI Infrastructure Next Stop: OpenAI Plans GW-Scale Mega Data Center in India, Sources Say

2025-9-1 16:53:37

Information

Jitterbug Upgrades AI Content Labeling Feature, Trial and Effective Today

2025-9-1 16:57:18

Search