August 27, 2012 - Technology media outlet marktechpost published a blog post on August 25, reporting thatMicrosoftreleaseOpen SourceText-to-speech (TTS) model VibeVoice-1.5B.Generate up to 90 minutes of natural speech from up to 4 different speakers at once, with support for cross-language and song synthesis.

In terms of architecture, VibeVoice-1.5B is based on the Qwen2.5 language model with 1.5B parameters, combining an Acoustic and Semantic Tokenizer, and processed at a low frame rate of 7.5Hz.
The acoustic lexicon uses a σ-VAE structure to compress the 24kHz raw audio to one part in 3200, while the semantic lexicon is trained by a speech recognition agent task to preserve dialog semantics. The decoding side uses a 123 million parameter diffusion decoder combined with a classifier free bootstrap and DPM-Solver to improve sound quality and detail.
The model gradually expands the context length from 4k to 65k tokens during training to ensure speech coherence and speaker consistency in long conversations, and its architecture supports multi-speaker turn-taking to simulate natural conversation scenarios, and it can generate long audio in streaming mode, laying the foundation for future real-time TTS.
VibeVoice-1.5B also has limitations, currently only supports English and Chinese, other languages may appear inaccurate or inappropriate content; does not support the speaker's voice overlap, and can not generate background sound effects or music. Microsoft explicitly prohibits the use of the model for voice impersonation, disinformation, or bypassing authentication, and reminds users to comply with the law and identify the source of AI generation.
Microsoft says the model is aimed at the research and developer community and is suitable forInternet audio subscription serviceproduction, conversational AI, speech content generation and other fields. In the future, the 7B version with larger parameters will be released to support low-latency interactions and higher fidelity real-time synthesis, further expanding the application scenarios.
1AI Attach reference address