The news of December 25thAli TongyiThe Qwen3-TTS family launched two new articlesAI ModelsSound creation model Qwen3-TTS-VD-Flash And sound cloning models Qwen3-TTS-VC-FlashI DON'T KNOW. 1AI WITH THE FOLLOWING MAIN FEATURES OF THE MODEL:
- Sound Creation: Qwen3-TTS-VD-Flash supports the input of complex natural language commands, achieves fine-tuning of sound, rhythm, emotion, man-made, etc., achieves full control from “what to say” to “how to say”, frees users to define what they want, frees themselves from cloning only on the basis of the available sound, or only selects a fixed part of it. The combined performance was significantly better than that of GPT-4o-mini-ttts, Mimo-udio-7b-instruct, and exceeded Gemini-2.5-pro-pre-view-tts in role-playing tests。
- tone cloning: Qwen3-TTS-VC-Flash supports 3s-level acoustic cloning and can be based on cloned acoustics in the main languages of Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, etc. In Mini Max TTS Multilingual Test Set, the average word error rate (WER) is generally better than Mini Max, Eleven Labs and GPT-4o-Audio-Preview。
- High performance: Qwen3-TTS-VD-Flash and Qwen3-TTS-VC-Flash have high-expressive, humanized acoustic color, capable of steadily and reliably exporting the speech content of the text that corresponds to the text and automatically adjusts the symmetrical rhythm to give a natural, live expression。
- Lu Bong's text skills: Qwen3-TTS-VD-Flash and Qwen3-TTS-VC-Flash have a strong text resolution capability that automatically processes complex text structures, extracts critical information with precision, and displays a greater degree of robustness in diverse, unorthodox text formats (Note: robustness, system ' s ability to maintain functional stability in the face of changes in its internal structure or external environment)。
Qwen3-TTS-VD-Flash
Qwen3-TTS supports generation through natural language descriptionsCustomised Sound ImageI don't know. Users are free to enter acoustic properties, descriptions, background information, etc. to easily create their desired voice image。
Controllable generation: Qwen3-TTS combined performance is significantly better than GPT-4o-mini-ttts, Mimo-udio-7b-instruct, and exceeds Gemini-2.5-pro-pre-view-ttts in role-play testing。

Qwen3-TTS-VC-Flash
Qwen3-TTS supports passnatural 3s level sound cloning, and can generate multilingual audio based on cloned sound, with a high degree of rout for complex text and wild audio。
Multilingual sound cloning: Qwen3-TTS has a more stable content in Chinese, English, French, Italian, and other languages than MiniMax, ElevenLabs and GPT-4o-Audio-Preview; it has the highest average word error rate (WER)。

Qwen3-TTS-Voice-Design API document:
https://www.alibabacloud.com/help/zh/model-studio/qwen-tts-voice-design?spm=a2ty_o06.30285417.0.0.56a0c9216Ey6VM
Qwen3-TTS-Voice-Clone API document:
https://www.alibabacloud.com/help/zh/model-studio/qwen-tts-voice-cloning?spm=a2ty_o06.30285417.0.0.56a0c921WnHNlN