{"id":15198,"date":"2024-07-09T13:30:34","date_gmt":"2024-07-09T05:30:34","guid":{"rendered":"https:\/\/www.1ai.net\/?p=15198"},"modified":"2024-07-09T13:30:34","modified_gmt":"2024-07-09T05:30:34","slug":"%e9%98%bf%e9%87%8c%e4%ba%91%e9%80%9a%e4%b9%89%e5%8d%83%e9%97%ae%e5%bc%80%e6%ba%90%e4%b8%a4%e6%ac%be%e8%af%ad%e9%9f%b3%e5%9f%ba%e5%ba%a7%e6%a8%a1%e5%9e%8b%ef%bc%8c%e8%af%86%e5%88%ab%e6%95%88%e6%9e%9c","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/15198.html","title":{"rendered":"Alibaba Cloud Tongyi Qianwen open-sources two voice base models, with better recognition performance than OpenAI Whisper"},"content":{"rendered":"<p data-vmark=\"d69d\"><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e9%98%bf%e9%87%8c%e4%ba%91\" title=\"_Other Organiser\" target=\"_blank\" >Alibaba Cloud<\/a><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e9%80%9a%e4%b9%89%e5%8d%83%e9%97%ae\" title=\"[View articles tagged with [Tongyi Thousand Questions]]\" target=\"_blank\" >Thousand Questions on Tongyi<\/a><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%bc%80%e6%ba%90\" title=\"[View articles tagged with [open source]]\" target=\"_blank\" >Open Source<\/a>Two models<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e8%af%ad%e9%9f%b3%e5%9f%ba%e5%ba%a7%e6%a8%a1%e5%9e%8b\" title=\"[Sees articles with tags]\" target=\"_blank\" >Voice Base Model<\/a> SenseVoice (for speech recognition) and CosyVoice (for speech generation).<\/p>\n<p data-vmark=\"c614\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-15199\" title=\"ca00bc4c-bedd-48df-97ad-aa3d4061ac6a\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/07\/ca00bc4c-bedd-48df-97ad-aa3d4061ac6a.png\" alt=\"ca00bc4c-bedd-48df-97ad-aa3d4061ac6a\" width=\"1440\" height=\"699\" \/><\/p>\n<p data-vmark=\"3c14\">SenseVoice focuses on<strong>High-precision multi-language speech recognition, emotion recognition, and audio event detection<\/strong>, has the following characteristics:<\/p>\n<ul class=\"medium-size list-paddingleft-2\">\n<li>\n<p data-vmark=\"ba92\"><strong>Multi-language recognition<\/strong>: Using more than 400,000 hours of data training, supporting more than 50 languages,<strong>The recognition effect is better than the Whisper model<\/strong><\/p>\n<\/li>\n<li>\n<p data-vmark=\"b49c\"><strong>Rich text recognition<\/strong>: It has excellent emotion recognition and can be used on test data<strong>Achieve or exceed the performance of the best emotion recognition models<\/strong>;Supports sound event detection capabilities, including music, applause, laughter, crying, coughing, sneezing and other common human-computer interaction events for detection<\/p>\n<\/li>\n<li>\n<p data-vmark=\"18c5\"><strong>Efficient reasoning<\/strong>: The SenseVoice-Small model uses a non-autoregressive end-to-end framework with extremely low inference latency. Inference of 10s audio takes only 70ms.<strong>15 times better than Whisper-Large<\/strong><\/p>\n<\/li>\n<li>\n<p data-vmark=\"c90b\"><strong>Fine-tuning customization<\/strong>: It has convenient fine-tuning scripts and strategies to help users fix long-tail sample problems according to business scenarios<\/p>\n<\/li>\n<li>\n<p data-vmark=\"6661\"><strong>Service deployment<\/strong>\uff1aIt has a complete service deployment link, supports multiple concurrent requests, and supports client languages such as python, c++, html, java and c#<\/p>\n<\/li>\n<\/ul>\n<p data-vmark=\"d465\">Compared with open source emotion recognition models, the SenseVoice-Large model can<strong>Achieved the best results on almost all data<\/strong>, and the SenseVoice-Small model can also outperform other open source models on most data sets.<\/p>\n<p data-vmark=\"bddc\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-15200\" title=\"7b64560f-2477-44d0-a048-362867595de2\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/07\/7b64560f-2477-44d0-a048-362867595de2.png\" alt=\"7b64560f-2477-44d0-a048-362867595de2\" width=\"999\" height=\"577\" \/><\/p>\n<p data-vmark=\"2128\">The CosyVoice model also supports multilingualism, timbre, and emotion control. The model performs well in multilingual speech, zero-sample speech generation, cross-lingual voice cloning, and command following.<\/p>\n<p data-vmark=\"e224\">Related Links:<\/p>\n<p data-vmark=\"e1b8\">SenseVoice:<a title=\"https:\/\/github.com\/FunAudioLLM\/SenseVoice\" href=\"https:\/\/github.com\/FunAudioLLM\/SenseVoice\" target=\"_blank\" rel=\"noopener\"><span class=\"link-text-start-with-http\">https:\/\/github.com\/FunAudioLLM\/SenseVoice<\/span><\/a><\/p>\n<p data-vmark=\"c2fc\">CosyVoice:<a title=\"https:\/\/github.com\/FunAudioLLM\/CosyVoice\" href=\"https:\/\/github.com\/FunAudioLLM\/CosyVoice\" target=\"_blank\" rel=\"noopener\"><span class=\"link-text-start-with-http\">https:\/\/github.com\/FunAudioLLM\/CosyVoice<\/span><\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>Aliyun Tongyi Qianwen has open sourced two speech base models, SenseVoice (for speech recognition) and CosyVoice (for speech generation). SenseVoice focuses on high-precision multi-language speech recognition, emotion recognition and audio event detection, with the following features: Multi-language recognition: using more than 400,000 hours of data training, support for more than 50 languages, recognition effect is better than the Whisper model Rich Text Recognition: with excellent emotion recognition, can reach and exceed the current best emotion recognition model on the test data effect SenseVoice: supports sound event detection, supports music, applause, laughter, crying, coughing, sneezing and other common human-computer interaction events for detection Efficient inference: SenseVoice<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[219,3415,331,334],"collection":[],"class_list":["post-15198","post","type-post","status-publish","format-standard","hentry","category-news","tag-219","tag-3415","tag-331","tag-334"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/15198","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=15198"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/15198\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=15198"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=15198"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=15198"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=15198"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}