{"id":16400,"date":"2024-07-25T08:47:40","date_gmt":"2024-07-25T00:47:40","guid":{"rendered":"https:\/\/www.1ai.net\/?p=16400"},"modified":"2024-07-25T08:47:40","modified_gmt":"2024-07-25T00:47:40","slug":"%e9%85%8d%e9%9f%b3%e5%91%98%e5%8d%b1%ef%bc%81%e5%be%ae%e8%bd%afvall-e-2%e6%a8%a1%e5%9e%8b%e8%af%ad%e9%9f%b3%e5%85%8b%e9%9a%86%e8%be%be%e5%88%b0%e9%85%8d%e9%9f%b3%e5%91%98%e6%b0%b4%e5%87%86","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/16400.html","title":{"rendered":"Voice actor danger! Microsoft&#039;s VALL-E 2 model voice clone reaches voice actor level"},"content":{"rendered":"<p data-pm-slice=\"0 0 []\">recently,<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%be%ae%e8%bd%af\" title=\"[View articles tagged with [Microsoft]]\" target=\"_blank\" >Microsoft<\/a>Published zero-shot text-to-speech (<a href=\"https:\/\/www.1ai.net\/en\/tag\/tts\" title=\"_OTHER ORGANISER\" target=\"_blank\" >TTS<\/a>) model VALLE-2 has attracted widespread attention in the technology community. This breakthrough achievement achieved human-level speech synthesis for the first time and is considered a milestone in the field of TTS.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-16401\" title=\"get-801\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/07\/get-801.jpg\" alt=\"get-801\" width=\"793\" height=\"580\" \/><\/div>\n<p data-track=\"53\"><strong>Technical highlights and innovations:<\/strong><\/p>\n<p data-track=\"54\">Zero-sample learning: VALLE-2 only needs a short unfamiliar voice sample to imitate the same voice to speak any text content, demonstrating amazing instant imitation capabilities.<\/p>\n<p data-track=\"55\">Repeat-aware sampling: Improved random sampling method, effectively alleviated the infinite loop problem and improved decoding stability.<\/p>\n<p data-track=\"56\">Grouped Code Modeling: By grouping encoder-decoder codes, the sequence length is reduced, speeding up the inference process while improving performance.<\/p>\n<p data-track=\"57\">Simplified training data requirements: VALLE-2 only requires simple speech-to-text transcription data for training, which greatly simplifies the data collection and processing process.<\/p>\n<p data-track=\"58\">Performance evaluation: In terms of subjective scores (SMOS and CMOS) and objective indicators (SIM, WER and DNSMOS), VALLE-2 not only surpasses the previous model VALLE, but also outperforms real human speech in some aspects.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-16402\" title=\"get-802\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/07\/get-802.jpg\" alt=\"get-802\" width=\"806\" height=\"610\" \/><\/div>\n<p data-track=\"59\"><strong>Ethical considerations and market responses:<\/strong><\/p>\n<p data-track=\"60\">Potential risks: VALLE-2&#039;s powerful voice imitation capabilities have raised concerns about the abuse of Deepfake technology.<\/p>\n<p data-track=\"61\">Microsoft is cautious about this and currently positions VALLE-2 as a pure research project with no plans for productization. It has made an ethical statement on the project page and in the paper, emphasizing the necessity of synthetic speech detection and authorization mechanisms.<\/p>\n<p data-track=\"62\">Some users expressed disappointment that Microsoft did not release a trial product. Industry insiders speculated that Microsoft might be avoiding potential risks and negative public opinion. As the technology matures and market competition intensifies, it may only be a matter of time before VALLE-2 or similar technologies are commercialized.<\/p>\n<p data-track=\"63\"><strong>Technical limitations and room for improvement:<\/strong><\/p>\n<p data-track=\"64\">Demo limitations: Currently, the public demonstration samples are limited, making it difficult to fully evaluate the model performance.<\/p>\n<p data-track=\"65\">Accent adaptability: The model&#039;s performance in handling non-British and American accents needs to be improved.<\/p>\n<p data-track=\"66\">Computational efficiency: Despite improvements, there is still room for optimization in inference speed.<\/p>\n<p data-track=\"67\">The emergence of VALLE-2 marks a new era for zero-sample TTS technology. It not only demonstrates the great potential of AI in the field of speech synthesis, but also triggers in-depth thinking about the ethics and responsible use of technology. As the technology further develops and improves, we can expect to see more innovative applications, but it also requires the industry, regulators, and the public to work together to ensure the responsible use of this powerful technology. In the future, VALLE-2 and similar technologies are likely to bring revolutionary changes in voice assistants, content creation, education and training, and will also promote the advancement of speech recognition and synthesis detection technology to address potential risks of abuse.<\/p>\n<p data-track=\"68\">Project address: https:\/\/www.microsoft.com\/en-us\/research\/project\/vall-ex\/vall-e-2\/<\/p>\n<p>&nbsp;<\/p>","protected":false},"excerpt":{"rendered":"<p>Recently, Microsoft released VALLE-2, a zero-sample text-to-speech (TTS) model that has attracted widespread attention in the technology community. This breakthrough achievement realizes the first speech synthesis at the same level as human beings, and is considered a landmark advancement in the TTS field. Technical Highlights and Innovations: Zero Sample Learning: VALLE-2 can mimic the same voice to speak any text content with only a short unfamiliar speech sample, demonstrating amazing instantaneous mimicry capability. Repetition-aware sampling: Improved random sampling method effectively alleviates the infinite loop problem and improves decoding stability. Grouped Code Modeling: Reduces sequence length by grouping codec codes, accelerating the inference process while improving performance. Simplified Training Data Requirements:VALLE-2 requires only simple speech<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[591,280,2110,3700],"collection":[],"class_list":["post-16400","post","type-post","status-publish","format-standard","hentry","category-news","tag-tts","tag-280","tag-2110","tag-3700"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/16400","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=16400"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/16400\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=16400"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=16400"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=16400"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=16400"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}