{"id":41840,"date":"2025-08-27T12:25:07","date_gmt":"2025-08-27T04:25:07","guid":{"rendered":"https:\/\/www.1ai.net\/?p=41840"},"modified":"2025-08-27T12:25:07","modified_gmt":"2025-08-27T04:25:07","slug":"%e6%92%ad%e5%ae%a2%e7%a5%9e%e5%99%a8%ef%bc%9a%e5%be%ae%e8%bd%af%e5%bc%80%e6%ba%90-vibevoice-1-5b-%e9%9f%b3%e9%a2%91%e6%a8%a1%e5%9e%8b%ef%bc%8c%e6%94%af%e6%8c%81%e4%b8%ad%e6%96%87%e3%80%81%e5%8f%af","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/41840.html","title":{"rendered":"Podcasting tool: Microsoft open source VibeVoice-1.5B audio model, support for Chinese, can generate 90-minute 4-person chat voice"},"content":{"rendered":"<p>August 27, 2012 - Technology media outlet marktechpost published a blog post on August 25, reporting that<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%be%ae%e8%bd%af\" title=\"[View articles tagged with [Microsoft]]\" target=\"_blank\" >Microsoft<\/a>release<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%bc%80%e6%ba%90\" title=\"[View articles tagged with [open source]]\" target=\"_blank\" >Open Source<\/a>Text-to-speech (TTS) model VibeVoice-1.5B.<strong>Generate up to 90 minutes of natural speech from up to 4 different speakers at once, with support for cross-language and song synthesis.<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-41841\" title=\"520a0014j00t1mxkp0062d000v900fkp\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/08\/520a0014j00t1mxkp0062d000v900fkp.jpg\" alt=\"520a0014j00t1mxkp0062d000v900fkp\" width=\"1125\" height=\"560\" \/><\/p>\n<p>In terms of architecture, VibeVoice-1.5B is based on the Qwen2.5 language model with 1.5B parameters, combining an Acoustic and Semantic Tokenizer, and processed at a low frame rate of 7.5Hz.<\/p>\n<p>The acoustic lexicon uses a \u03c3-VAE structure to compress the 24kHz raw audio to one part in 3200, while the semantic lexicon is trained by a speech recognition agent task to preserve dialog semantics. The decoding side uses a 123 million parameter diffusion decoder combined with a classifier free bootstrap and DPM-Solver to improve sound quality and detail.<\/p>\n<p>The model gradually expands the context length from 4k to 65k tokens during training to ensure speech coherence and speaker consistency in long conversations, and its architecture supports multi-speaker turn-taking to simulate natural conversation scenarios, and it can generate long audio in streaming mode, laying the foundation for future real-time TTS.<\/p>\n<p>VibeVoice-1.5B also has limitations, currently only supports English and Chinese, other languages may appear inaccurate or inappropriate content; does not support the speaker's voice overlap, and can not generate background sound effects or music. Microsoft explicitly prohibits the use of the model for voice impersonation, disinformation, or bypassing authentication, and reminds users to comply with the law and identify the source of AI generation.<\/p>\n<p>Microsoft says the model is aimed at the research and developer community and is suitable for<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e6%92%ad%e5%ae%a2\" title=\"[Sees articles with [crowding] labels]\" target=\"_blank\" >Internet audio subscription service<\/a>production, conversational AI, speech content generation and other fields. In the future, the 7B version with larger parameters will be released to support low-latency interactions and higher fidelity real-time synthesis, further expanding the application scenarios.<\/p>\n<p><strong>1AI Attach reference address<\/strong><\/p>\n<ul>\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"0f2e\"><a href=\"https:\/\/github.com\/microsoft\/VibeVoice\/blob\/main\/report\/TechnicalReport.pdf\" target=\"_blank\" rel=\"noopener\">Microsoft VibeVoice-1.5B Technical Report<\/a><\/p>\n<\/li>\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"e83a\"><a href=\"https:\/\/huggingface.co\/microsoft\/VibeVoice-1.5B\" target=\"_blank\" rel=\"noopener\">Hugging Face<\/a><\/p>\n<\/li>\n<li class=\"list-undefined list-reference-paddingleft\">\n<p data-vmark=\"c067\"><a href=\"https:\/\/github.com\/microsoft\/VibeVoice\" target=\"_blank\" rel=\"noopener\">GitHub<\/a><\/p>\n<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>August 27 news, technology media marktechpost on August 25 published a blog post, reported that Microsoft released open source text-to-speech (TTS) model VibeVoice-1.5B, which can be generated at a time of up to 90 minutes, up to four different speakers of natural speech, and support cross-language and song synthesis. Architecturally, VibeVoice-1.5B is based on the Qwen2.5 language model with 1.5B parameters, and combines an Acoustic and Semantic Tokenizer with a low 7.5Hz frame rate. The Acoustic Tokenizer uses a \u03c3-VAE structure to compress the 24kHz raw audio into a single Tokenizer.<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[219,280,6928,6603],"collection":[],"class_list":["post-41840","post","type-post","status-publish","format-standard","hentry","category-news","tag-219","tag-280","tag-6928","tag-6603"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/41840","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=41840"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/41840\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=41840"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=41840"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=41840"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=41840"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}