Ali releases China's first "hybrid reasoning model" Qwen3 and open source: supports two modes of thinking, pre-training about 36 trillion tokens, 119 languages and dialects

April 29 - Early this morning.AliBaba released the new generation of Tongyi Qianqian Qwen3 model, which is the strongest in the world.Open SourceModel.

Models 

Layers

Heads 

(Q / KV)

Tie Embedding

Context Length

Qwen3-0.6B

28

16 / 8

Yes

32K

Qwen3-1.7B 

28

16 / 8

Yes

32K

Qwen3-4B

36

32 / 8

Yes

32K

Qwen3-8B 

36

32 / 8

No

128K

Qwen3-14B 

40

40 / 8

No

128K

Qwen3-32B

64

64 / 8

No

128K

Models 

Layers

Heads 

(Q / KV)

Experts (Total/ Activated)

Context Length

Qwen3-30B-A3B

48

32 / 4

128 / 8

128K

Qwen3-235B-A22B

94

64 / 4

128 / 8

128K

This is the first of its kind in the country."mixed inference model"The "Fast Thinking" and "Slow Thinking" are integrated into the same model, which greatly saves arithmetic power consumption.

Post-trained models, such as Qwen3-30B-A3B, and their pre-trained base models (e.g., Qwen3-30B-A3B-Base) are openly available on major platforms. Meanwhile, Aliyun open-sources the weights of two MoE models:

  • Qwen3-235B-A22BThe model is a large model with more than 235 billion total parameters and more than 22 billion activation parameters.
  • Qwen3-30B-A3BThe MoE model is a small MoE model with about 30 billion total parameters and 3 billion activation parameters.

In addition, six Dense models have been open-sourced, including Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B, all of which are open-sourced under the Apache 2.0 license.

According to AliCloud, its flagship model Qwen3-235B-A22B Demonstrates highly competitive results against top models such as DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro in benchmarks for code, math, and general-purpose capabilities.

In addition, the small MoE model Qwen3-30B-A3B has 10% more activation parameters than the QwQ-32B, which is a much better performance, and even a small model like the Qwen3-4B can match the performance of the Qwen2.5-72B-Instruct.

Ali releases China's first "hybrid reasoning model" Qwen3 and open source: supports two modes of thinking, pre-training about 36 trillion tokens, 119 languages and dialects

Core Highlights

Multiple modes of thinking

The Qwen3 model supports two modes of thinking:

  • Thinking Patterns:In this model, the model reasons step-by-step and gives a final answer after much deliberation. This approach is ideal for complex problems that require deep thinking.
  • Non-thinking mode:In this mode, the model provides fast, near-instantaneous responses for simple problems that require speed over depth.

This flexibility allows the user to control the extent to which the model "thinks" according to the specific task. For example, complex problems can be solved with extended reasoning steps, while simple problems can be answered directly and quickly without delay.

Critically, the combination of these two modes greatly enhances the model's ability to achieve stable and efficient "thinking budget" control. As mentioned above, Qwen3 exhibits scalable and smooth performance improvements that are directly related to the allocated computational reasoning budget. This design makes it easier for users to configure specific budgets for different tasks, achieving a better balance between cost-effectiveness and reasoning quality.

polyglot

Qwen3 models support 119 languages and dialects, such as Simplified Chinese, Traditional Chinese, and Cantonese. This extensive multilingual capability opens up new possibilities for international applications, allowing users worldwide to benefit from the power of these models.

pre-training

In terms of pre-training, the Qwen3 dataset has been significantly expanded compared to Qwen2.5, which was developed on the 18 trillion tokens for pre-training, while Qwen3 uses almost twice as much data, amounting to approximately 36 trillion tokensIt covers 119 languages and dialects.

To build this huge dataset, Aliyun not only collects data from the web, but also extracts information from PDF documents, extracts text from these documents with Qwen2.5-VL, and improves the quality of the extracted content with Qwen2.5.

To increase the amount of math and code data, Aliyun also synthesized data using Qwen2.5-Math and Qwen2.5-Coder, two expert models in the fields of math and code, which synthesize data in a variety of forms including textbooks, quiz pairs, and code snippets.

According to AliCloud, the Qwen3 pre-training process is divided into three phases.

  • In the first stage (S1), the model was pre-trained on more than 30 trillion tokens with a context length of 4K tokens.This stage provided the model with basic language skills and generalized knowledge.
  • In the second stage (S2), the dataset was improved by increasing the proportion of knowledge-intensive data (e.g., STEM, programming, and reasoning tasks), and the model was subsequently pretrained on an additional 5 trillion tokens.
  • In the final stage, the context length is extended to 32K tokens using high-quality long context data, ensuring that the model can handle longer inputs efficiently.

The overall performance of the Qwen3 Dense base model is comparable to that of the more parameterized Qwen2.5 base model due to the improved model architecture, increased training data, and more efficient training methods. For example, Qwen3-1.7B / 4B / 8B / 14B / 32B-Base performs comparably to Qwen2.5-3B / 7B / 14B / 32B / 72B-Base, respectively.

In particular, the Qwen3 Dense base models outperform even the larger Qwen2.5 models in areas such as STEM, coding, and inference. For the Qwen3 MoE base models, they achieve similar performance to the Qwen2.5 Dense base models using only the 10% activation parameter. This resulted in significant savings in training and inference costs.

after-training

In order to develop hybrid models capable of both thoughtful reasoning and rapid response, Aliyun implemented a four-phase training process that consists of:

  • (1) Cold start for long thinking chains.
  • (2) long chains of thought to reinforce learning.
  • (3) Mindset integration.
  • (4) Generic Intensive Learning.

In the first phase, Aliyun fine-tuned the model using diverse long thought chain data covering a wide range of tasks and domains such as math, code, logical reasoning, and STEM problems. This process was designed to equip the model with basic reasoning capabilities. The second phase focuses on large-scale reinforcement learning, which utilizes rule-based rewards to enhance the model's exploration and drilling capabilities.

In the third phase, Aliyun fine-tuned the model on a combined piece of data that included long thought chain data and commonly used instruction fine-tuning data to integrate non-thinking modes into the thinking model. A seamless integration of reasoning and rapid response capabilities is ensured.

Finally, in the fourth phase, Aliyun applied reinforcement learning on tasks in more than 20 generic domains, including instruction adherence, format adherence, and Agent capabilities, to further enhance the generalized capabilities of the model and to correct undesirable behaviors.

Advanced Usage

Aliyun also provides a soft-switching mechanism for deployment users that allows them to dynamically control the model's behavior when enable_thinking=True. Specifically, you can add /think and /no_think to user prompts or system messages to switch the model's thinking mode round by round. In a multi-round dialog, the model will follow the most recent instruction.

1AI Reminder: You can try out Qwen3 on the Qwen Chat web version (chat.qwen.ai) and in the Tongyi App.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

Unpaid Ads: OpenAI ChatGPT Search Search Goes Live for Online Purchases, Insights into Product Pros and Cons Based on Reviews AI

2025-4-29 11:11:23

Information

Nobel Laureate Hinton: AI May Compete for Control in Human Hands

2025-4-29 11:19:44

Search