Podcasting tool: Microsoft open source VibeVoice-1.5B audio model, support for Chinese, can generate 90-minute 4-person chat voice

August 27, 2012 - Technology media outlet marktechpost published a blog post on August 25, reporting thatMicrosoftreleaseOpen SourceText-to-speech (TTS) model VibeVoice-1.5B.Generate up to 90 minutes of natural speech from up to 4 different speakers at once, with support for cross-language and song synthesis.

Podcasting tool: Microsoft open source VibeVoice-1.5B audio model, support for Chinese, can generate 90-minute 4-person chat voice

In terms of architecture, VibeVoice-1.5B is based on the Qwen2.5 language model with 1.5B parameters, combining an Acoustic and Semantic Tokenizer, and processed at a low frame rate of 7.5Hz.

The acoustic lexicon uses a σ-VAE structure to compress the 24kHz raw audio to one part in 3200, while the semantic lexicon is trained by a speech recognition agent task to preserve dialog semantics. The decoding side uses a 123 million parameter diffusion decoder combined with a classifier free bootstrap and DPM-Solver to improve sound quality and detail.

The model gradually expands the context length from 4k to 65k tokens during training to ensure speech coherence and speaker consistency in long conversations, and its architecture supports multi-speaker turn-taking to simulate natural conversation scenarios, and it can generate long audio in streaming mode, laying the foundation for future real-time TTS.

VibeVoice-1.5B also has limitations, currently only supports English and Chinese, other languages may appear inaccurate or inappropriate content; does not support the speaker's voice overlap, and can not generate background sound effects or music. Microsoft explicitly prohibits the use of the model for voice impersonation, disinformation, or bypassing authentication, and reminds users to comply with the law and identify the source of AI generation.

Microsoft says the model is aimed at the research and developer community and is suitable forInternet audio subscription serviceproduction, conversational AI, speech content generation and other fields. In the future, the 7B version with larger parameters will be released to support low-latency interactions and higher fidelity real-time synthesis, further expanding the application scenarios.

1AI Attach reference address

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

A picture can generate cinematic digital human video: AliCloud Tongyi Wan2.2-S2V video generation model announced open source

2025-8-27 12:21:41

Information

Google Gemini 2.5 Flash Upgrades AI Retouching Features, Outperforms GPT-4o in Several Ways

2025-8-27 12:26:55

Search