{"id":12342,"date":"2024-06-05T10:15:40","date_gmt":"2024-06-05T02:15:40","guid":{"rendered":"https:\/\/www.1ai.net\/?p=12342"},"modified":"2024-06-05T10:17:04","modified_gmt":"2024-06-05T02:17:04","slug":"chattts%e6%b7%b1%e5%ba%a6%e4%bd%93%e9%aa%8c%ef%bc%8c%e5%bc%80%e6%ba%90%e6%9c%80%e5%bc%ba%e6%96%87%e6%9c%ac%e8%bd%ac%e8%af%ad%e9%9f%b3tts%e5%b7%a5%e5%85%b7","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/12342.html","title":{"rendered":"ChatTTS in-depth experience, the most powerful open source text-to-speech (TTS) tool"},"content":{"rendered":"<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12343\" title=\"get-149\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-149.jpg\" alt=\"get-149\" width=\"1080\" height=\"449\" \/><\/div>\n<p>recent,<strong><a href=\"https:\/\/www.1ai.net\/en\/tag\/chattts\" title=\"[See articles with [ChatTS] label]\" target=\"_blank\" >ChatTTS<\/a> <\/strong>This one<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e8%af%ad%e9%9f%b3%e7%94%9f%e6%88%90%e9%a1%b9%e7%9b%ae\" title=\"[Sees articles with [Voice Generation Project] labels]\" target=\"_blank\" >Speech Generation Project<\/a>exist <a href=\"https:\/\/www.1ai.net\/en\/tag\/github\" title=\"_Other Organiser\" target=\"_blank\" >GitHub<\/a> It quickly gained attention.<strong>June 4<\/strong>, 6 days have been achieved<strong>18.9 thousand stars<\/strong>\ud83c\udf1f. All the netizens said it was amazing! According to this trend, it will soon break through<strong>20,000 stars<\/strong>.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12344\" title=\"get-150\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-150.jpg\" alt=\"get-150\" width=\"1080\" height=\"1175\" \/><\/div>\n<p data-track=\"152\">Website:<u>https:\/\/github.com\/2noise\/ChatTTS<\/u><\/p>\n<p data-track=\"153\">ChatTTS is a program designed specifically for conversation scenarios.<strong><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e6%96%87%e6%9c%ac%e7%94%9f%e6%88%90%e8%af%ad%e9%9f%b3\" title=\"[Sees articles with [text-generated voice] labels]\" target=\"_blank\" >Text to Speech<\/a><\/strong>Model (TTS, or Text-To-Speech), which supports multiple languages, including English and Chinese. The largest model is trained with 100,000 hours of Chinese and English data.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%bc%80%e6%ba%90\" title=\"[View articles tagged with [open source]]\" target=\"_blank\" >Open Source<\/a>The version is the one with 40,000 hours of training and without SFT. This ensures the high quality and naturalness of the sound synthesis.<\/p>\n<p data-track=\"154\">According to the official introduction, ChatTTS has<strong>3 highlights<\/strong>:<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12345\" title=\"get-151\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-151.jpg\" alt=\"get-151\" width=\"1080\" height=\"448\" \/><\/div>\n<p data-track=\"155\">After listening to the official audio version of the self-introduction in Github, the character&#039;s voice is very<strong>Realistic, natural, flow, pauses, laughter<\/strong>.<\/p>\n<p data-track=\"156\">Then let&#039;s try the official prompt words to see how it works:<\/p>\n<p data-track=\"158\">The code is as follows:<\/p>\n<pre><code>inputs_cn = &quot;&quot;&quot;chat TTS is a powerful conversational text-to-speech model. It has mixed Chinese and English reading and multi-speaker capabilities. chat TTS can not only generate natural and fluent speech, but also control paralinguistic phenomena such as [laugh] laughter [laugh], pauses [uv_break] and modal particles [uv_break]. This prosody surpasses many open source models [uv_break]. Please note that the use of chat TTS should comply with legal and ethical standards to avoid security risks of abuse. [uv_break]&#039;&quot;&quot;&quot;.replace(&#039;\\n&#039;, &#039;&#039;)params_refine_text = {&#039;prompt&#039;: &#039;[oral_2][laugh_0][break_4]&#039;} audio_array_cn = chat.infer(inputs_cn, params_refine_text=params_refine_text)# audio_array_en = chat.infer(inputs_en, params_refine_text=params_refine_text)torchaudio.save(&quot;output3.wav&quot;, torch.from_numpy(audio_array_cn[0]), 24000)<\/code><\/pre>\n<p data-track=\"160\">In addition to the official self-introduction above, the one that everyone is most familiar with is definitely the one that is most commonly seen these days\u2014\u2014<strong>Sichuan Food Reading<\/strong>I have to say that the generated effect is really natural and smooth!<\/p>\n<p data-track=\"161\"><strong>Written in front:<\/strong><\/p>\n<p data-track=\"162\">This article is mainly divided into three parts. Friends who are interested in a certain section can jump directly to read \ud83d\udcd6.<\/p>\n<ul>\n<li data-track=\"163\">ChatTTS in-depth review (PK seven emotions and six desires, excerpts of audio are included in the article, but I won\u2019t put it in here if it\u2019s too irritating to the ears.)<\/li>\n<li data-track=\"164\">How to use ChatTTS<\/li>\n<li data-track=\"165\">Other open source TTS project recommendations<\/li>\n<\/ul>\n<h1 class=\"pgc-h-arrow-right\" spellcheck=\"false\" data-track=\"166\">ChatTTSPK Human &quot;Seven Emotions and Six Desires&quot;<\/h1>\n<p data-track=\"167\">Everyone has emotions. It is said that the voice generated by ChatTTS is very realistic and natural. Let&#039;s challenge our<strong>&quot;Seven Emotions and Six Desires&quot;<\/strong>, let&#039;s see how capable it is! We use ChatTTS<strong>Text Control Tags<\/strong>To enrich the emotional expression of voice, please enjoy the following details:<\/p>\n<p data-track=\"168\"><strong>Desire for Gain:<\/strong><\/p>\n<p data-track=\"169\">Every time I see my investment numbers double, I feel excited like I have discovered a new world [break_1], which makes me unable to stop.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12346\" title=\"get-152\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-152.jpg\" alt=\"get-152\" width=\"1080\" height=\"336\" \/><\/div>\n<p data-track=\"171\">The whole sentence flows naturally, and the emotion of the word &quot;excited&quot; is relatively prominent.<\/p>\n<p data-track=\"172\"><strong>Desire for Food:<\/strong><\/p>\n<p data-track=\"173\">When I saw the sumptuous dinner being placed on the table, I couldn&#039;t help but drool[break_1][oral_3], and every dish made me salivate.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12347\" title=\"get-153\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-153.jpg\" alt=\"get-153\" width=\"1080\" height=\"301\" \/><\/div>\n<p data-track=\"175\">There are some overlapping words in the text output, and the ending sounds unclear in the speech, but the overall emotion is full and relatively smooth.<\/p>\n<p data-track=\"176\"><strong>Desire for Sleep:<\/strong><\/p>\n<p data-track=\"177\">After a busy day, I just want to fall into the soft bed[lbreak] and indulge in a sweet dream[oral_4].<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12348\" title=\"get-154\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-154.jpg\" alt=\"get-154\" width=\"1080\" height=\"344\" \/><\/div>\n<p data-track=\"179\">The overall flow is natural and the emotions are on point.<\/p>\n<p data-track=\"180\"><strong>Desire for Wealth:<\/strong><\/p>\n<p data-track=\"181\">Every time I think about the moment of winning the jackpot, my heart is filled with excitement[laugh_2] and endless fantasies[break_2].<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12350\" title=\"get-156\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-156.jpg\" alt=\"get-156\" width=\"1080\" height=\"706\" \/><\/div>\n<p data-track=\"182\">This is starting to get garbled.<\/p>\n<p data-track=\"183\"><strong>Desire for Fame:<\/strong><\/p>\n<p data-track=\"184\">The moment I stood under the flash lights, I felt like I was the center of the world [laugh_1], with all eyes on me.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12349\" title=\"get-155\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-155.jpg\" alt=\"get-155\" width=\"1080\" height=\"272\" \/><\/div>\n<p data-track=\"186\">The whole sentence is very smooth, and after listening to it, I feel like I am standing under a &quot;flashlight&quot;. The most interesting thing is that there is a &quot;sneak sneer&quot; at the end, and it feels very sneaky.<\/p>\n<p data-track=\"187\"><strong>Desire for Sex:<\/strong><\/p>\n<p data-track=\"188\">Under that charming light, I was deeply attracted by the other person&#039;s deep eyes[break_3] and couldn&#039;t extricate myself.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12351\" title=\"get-157\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-157.jpg\" alt=\"get-157\" width=\"1080\" height=\"336\" \/><\/div>\n<p data-track=\"189\">A master of reading without emotion, but extremely natural and fluent.<\/p>\n<p data-track=\"190\"><strong>There are seven emotions<\/strong><\/p>\n<p data-track=\"191\">In order to better express the complex emotional system of &quot;seven emotions and six desires&quot;, we can also finely control the emotional expression of speech by embedding control tags in the text, as follows:<\/p>\n<p data-track=\"192\"><strong>Joy<\/strong>:<\/p>\n<p data-track=\"193\">-Original text: I finally got the long-awaited promotion [laugh_1], I feel like I\u2019m on top of the world [break_2], and all my hard work has paid off.<\/p>\n<p data-track=\"194\">-Text output: Start<strong>Garbled code repeats<\/strong>, many times it needs to be generated multiple times.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12352\" title=\"get-158\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-158.jpg\" alt=\"get-158\" width=\"1080\" height=\"654\" \/><\/div>\n<p data-track=\"195\">-Audio output: There are basically no complete words or sentences generated in the audio, it is basically just random shouting.<\/p>\n<p data-track=\"196\">We will find that the effects of these two times are<strong>Text Output<\/strong>This part is already problematic.<\/p>\n<p data-track=\"197\"><strong>2. Anger<\/strong><\/p>\n<p data-track=\"198\">-Original text: I was so furious when I saw that unfair report [lbreak], how could the facts be distorted like this [oral_5]?<\/p>\n<p data-track=\"199\">-Text output: All text is output correctly.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12353\" title=\"get-159\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-159.jpg\" alt=\"get-159\" width=\"1080\" height=\"832\" \/><\/div>\n<p data-track=\"200\">-Audio output: The first half of the sentence is very calm, with a pause in the middle, but the second half does not sound angry at all, and even has a slight &quot;smile&quot;, which is not very reasonable.<\/p>\n<p data-track=\"202\">This part of the sentence can be read completely, and even has pauses, but the silly girl is a little immature in the emotional changes, and it is almost impossible to hear the emotional changes. (\ud83d\udc30 Blind guess: only the smile and pauses are the most obvious.<\/p>\n<p data-track=\"203\"><strong>Sorrow:<\/strong><\/p>\n<p data-track=\"204\">-Original text: At the farewell ceremony, I tried to suppress my sadness [break_4][oral_2], but tears still flowed out.<\/p>\n<p data-track=\"205\">-Text output: Tried twice, both<strong>The text for &quot;but tears still welled up in my eyes&quot; is missing<\/strong>.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12354\" title=\"get-160\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-160.jpg\" alt=\"get-160\" width=\"1080\" height=\"750\" \/><\/div>\n<p data-track=\"206\">-Audio output: The emotion is very calm, but<strong>Interjections with &quot;\u55ef&quot;<\/strong>.<\/p>\n<p data-track=\"207\">The third sentence is in output <strong>The text output part has already begun to miss the original information<\/strong>Then, in the voice part, the following half sentence &quot;<strong>But the tears still flowed<\/strong>&quot;It is directly lost and no speech can be generated. The emotion is very calm, but you can hear a bit of dialect.<\/p>\n<p data-track=\"208\"><strong>Happiness:<\/strong><\/p>\n<p data-track=\"209\">-Original text: At a friend&#039;s wedding, we laughed together [laugh_2], and the happiness we felt at that moment [break_1] was incomparable.<\/p>\n<p data-track=\"210\">-Text output: After half of the text is output<strong>Start to garble,<\/strong>The following is basically a stuttering state.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12355\" title=\"get-161\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-161.jpg\" alt=\"get-161\" width=\"1080\" height=\"830\" \/><\/div>\n<p data-track=\"211\">- Audio output: It's fun. It's basically a state of stammering, \"I, I, I...\" and so on\u3002<\/p>\n<p data-track=\"212\">This time the effect is very similar to the first time, basically no complete words appear, although there is no complete word output, but the emotions are very full, and there are laughter hidden in the embarrassment \ud83d\ude02. As for how much fun it is, it is still better to hear it with your ears.<\/p>\n<p data-track=\"213\"><strong>Thoughtfulness:<\/strong><\/p>\n<p data-track=\"214\">-Original text: Watching the sunset gradually set[break_3], I fell into deep thought[oral_1], pondering the true meaning of life.<\/p>\n<p data-track=\"215\">-Text output: The first half of the text output has the word \u201c\u897f\u4e0b\u201d added, which causes repetition in the content.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12357\" title=\"get-163\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-163.jpg\" alt=\"get-163\" width=\"1080\" height=\"814\" \/><\/div>\n<p data-track=\"216\">-Audio output: The audio also has a double &quot;west down&quot;, which sounds strange, but emotionally it has a particularly thoughtful feel.<\/p>\n<p data-track=\"218\">In the text output part, in addition to missing sentences from time to time,<strong>There is word &quot;overlap&quot;<\/strong>This directly affects the final generated speech result.<strong>It&#039;s a bit like drawing cards, it takes multiple times to be satisfied<\/strong>.<\/p>\n<p data-track=\"219\"><strong>Surprise:<\/strong><\/p>\n<p data-track=\"220\">-Original text: When I accidentally found the lost letter[lbreak], my surprise was beyond words[break_2].<\/p>\n<p data-track=\"221\">-Text output: The text content is output completely.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12356\" title=\"get-162\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-162.jpg\" alt=\"get-162\" width=\"1080\" height=\"756\" \/><\/div>\n<p data-track=\"222\">-Audio output: Complete audio output,<strong>Pause at the right time<\/strong>, I feel emotional when reading the words &quot;unintentionally&quot;, but the emotion of surprise in the second half of the sentence is not reflected.<\/p>\n<p data-track=\"223\">This time, the text and audio were completely output. There were some emotions involved, but the fluctuations were not obvious, and no surprise could be heard. It was like a calm reading.<\/p>\n<p data-track=\"224\"><strong>Fear:<\/strong><\/p>\n<p data-track=\"225\">-Original text: The strange noises at night make me shudder[break_3][oral_8], and every sound makes me shudder.<\/p>\n<p data-track=\"226\">-Text output: text output effect<strong>One more word &quot;way&quot;<\/strong>, the rest are complete.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12358\" title=\"get-164\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-164.jpg\" alt=\"get-164\" width=\"1080\" height=\"765\" \/><\/div>\n<p data-track=\"227\">-Audio output: The reading was very natural, and the overall emotions were relatively full, especially when he read the word &quot;shudder&quot;, he even took a breath.<\/p>\n<p data-track=\"229\">This time, the features of ChatTTS are fully demonstrated.<strong>Smooth and natural, and the emotions are very appropriate<\/strong>, there is even a second human voice &quot;hmm&quot; at the end, which shows that there are still many treasure functions waiting to be discovered.<\/p>\n<p data-track=\"230\">We will find ChatTTS<strong>The output is not stable, sometimes complete, sometimes missing parts<\/strong>, again, the same sentence (important things are said three times\u203c\ufe0f):<\/p>\n<p data-track=\"231\"><strong>Draw more cards and try more!<\/strong><\/p>\n<p data-track=\"232\"><strong>Draw more cards and try more!<\/strong><\/p>\n<p data-track=\"233\"><strong>Draw more cards and try more!<\/strong><\/p>\n<p data-track=\"234\">In general, through the control marks of pauses, laughter and oral characteristics, ChatTTS can more accurately convey complex emotional states and improve the expressiveness and interactivity of voice content. But relatively speaking, there is still some distance to go.<\/p>\n<p data-track=\"235\"><strong>Summarize<\/strong><\/p>\n<p data-track=\"236\">In fact, after this review, ChatTTS is quite popular on GitHub.<strong>reason<\/strong>For example:<\/p>\n<p data-track=\"237\"><strong>\u2013 Multilingual support:<\/strong> Whether you speak Chinese or English, this thing can handle it.<\/p>\n<p data-track=\"238\"><strong>\u2013 Sound has feelings:<\/strong> It can add laughter or change the tone of voice when speaking to make the conversation more natural.<\/p>\n<p data-track=\"239\"><strong>\u2013 Usability:<\/strong> Its setup process is simple and straightforward, and it can be smoothly integrated into various programs.<\/p>\n<p data-track=\"240\">But there are also some<strong>question<\/strong>:<\/p>\n<p data-track=\"241\"><strong>\u2013 Sometimes it gets stuck<\/strong>, intermittent, affecting the experience.<\/p>\n<p data-track=\"242\"><strong>\u2013 The quality of the sound is uneven<\/strong> Sometimes it takes several tries to get a sound that sounds good.<\/p>\n<p data-track=\"243\">Some I can think of<strong>Usage scenarios<\/strong>(Of course there are more):<\/p>\n<p data-track=\"244\"><strong>\u2013 Robots and virtual assistants:<\/strong> Real sound output is particularly suitable for enhancing the user&#039;s interactive experience.<\/p>\n<p data-track=\"245\"><strong>\u2013 Production of multimedia content:<\/strong> For example, audio books or story telling can be used.<\/p>\n<h1 class=\"pgc-h-arrow-right\" spellcheck=\"false\" data-track=\"246\">How to use ChatTTS<\/h1>\n<h1 class=\"pgc-h-arrow-right\" spellcheck=\"false\" data-track=\"247\">For text preprocessing: embed controls in text<\/h1>\n<p data-track=\"248\">At the text level, ChatTTS uses special tags as<strong>Embedded Commands<\/strong>These tags allow you to<strong>Controlled pauses and laughter<\/strong>and others<strong>oral<\/strong>aspect.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12359\" title=\"get-165\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-165.jpg\" alt=\"get-165\" width=\"1080\" height=\"231\" \/><\/div>\n<ul>\n<li data-track=\"249\">Sentence-level control: inserting<strong> [laugh_(0-2)] <\/strong>Such a mark, bringing laughter,<strong>[break_(0-7)] <\/strong>to indicate pauses of varying lengths, and<strong> [oral_(0-9)] <\/strong>to control other oral features.<\/li>\n<li data-track=\"250\">Word-level control: By placing<strong> [uv_break] <\/strong>and<strong> [lbreak] <\/strong>,accomplish<strong>Fine control of pauses within sentences<\/strong>.<\/li>\n<\/ul>\n<p data-track=\"251\">For example, if you are creating a whimsical AI character for a children\u2019s storytelling app, you can use ChatTTS to create text like this (English picture book):<\/p>\n<pre><code>&quot;Once upon a time, [uv_break] in a land filled with talking carrots and singing potatoes, [break_2] lived a little firefly named Flicker. [laugh] Flicker loved to [uv_break] dance among the moonbeams!&quot;<\/code><\/pre>\n<p data-track=\"253\">Actual generated effect: The English is completed in one go and sounds very good.<\/p>\n<p data-track=\"255\">Chinese picture books<\/p>\n<pre><code>&quot;Once upon a time,[uv_break] in a land full of talking carrots and singing potatoes,[break_2] there lived a little firefly named Flack.[laugh] Fireflies love to dance in the[uv_break] moonlight!&quot;<\/code><\/pre>\n<p data-track=\"257\">The actual generated effect: long paragraphs of text read very smoothly and there are also laughs.<\/p>\n<p data-track=\"259\">By refining these markers, you can have ChatTTS generate a voice that pauses for dramatic effect, laughs warmly, and brings that fantasy world to life.<\/p>\n<h1 class=\"pgc-h-arrow-right\" spellcheck=\"false\" data-track=\"260\">Inference parameters: fine-tuning output<\/h1>\n<p data-track=\"261\">During the audio generation process (inference), you can further refine the output via arguments passed to the chat.infer() function:<\/p>\n<p data-track=\"262\">1\ufe0f\u20e3 params_infer_code: This dictionary controls aspects such as speaker identity (spk_emb), voice variation (temperature), and decoding strategy (top_P, top_K).<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12360\" title=\"get-166\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-166.jpg\" alt=\"get-166\" width=\"1080\" height=\"672\" \/><\/div>\n<p data-track=\"263\">2\ufe0f\u20e3 params_refine_text: This dictionary is mainly used for sentence-level control, similar to<strong>How to use tags inside text<\/strong>.<\/p>\n<p data-track=\"264\"><strong>These two levels of control combined enable unprecedented expressiveness and customization of synthesized speech.<\/strong><\/p>\n<p data-track=\"265\"><strong>Notice:<\/strong>There are many experience addresses on the Internet, but this https:\/\/chattts.com\/ is not an official website, but you can experience it directly without deployment.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12362\" title=\"get-168\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-168.jpg\" alt=\"get-168\" width=\"1080\" height=\"721\" \/><\/div>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12361\" title=\"get-167\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-167.jpg\" alt=\"get-167\" width=\"1080\" height=\"172\" \/><\/div>\n<p data-track=\"266\">The official entrance is here:<u>https:\/\/github.com\/2noise\/ChatTTS<\/u><\/p>\n<p data-track=\"267\">Students with a basic knowledge of coding can try it themselves, or use the address that netizens have deployed in colab:<\/p>\n<p data-track=\"268\"><u>https:\/\/colab.research.google.com\/github\/Kedreamix\/ChatTTS\/blob\/main\/ChatTTS_infer.ipynb<\/u><\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12363\" title=\"get-169\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-169.jpg\" alt=\"get-169\" width=\"1080\" height=\"603\" \/><\/div>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-12364\" title=\"get-170\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/06\/get-170.jpg\" alt=\"get-170\" width=\"1080\" height=\"505\" \/><\/div>\n<h1 class=\"pgc-h-arrow-right\" spellcheck=\"false\" data-track=\"269\">Other open source TTS models are also worth paying attention to<\/h1>\n<ul>\n<li data-track=\"270\"><strong>Bark <\/strong>It is a transformer-based TTS model proposed by Suno AI. The model is able to generate a variety of audio outputs, including speech, music, background noise, and simple sound effects. In addition, it can also generate non-verbal speech such as laughter, sighs, and sobs. Among them, tone and laughter effects are the main advantages.<\/li>\n<li data-track=\"271\">Project address: <u>https:\/\/github.com\/suno-ai\/bark<\/u><\/li>\n<li data-track=\"272\"><strong>Piper TTS<\/strong>(Text-to-Speech) is a neural network-based text-to-speech system optimized for low-power computers and hardware such as Raspberry Pi. At its core, it is a fast, flexible, and easy-to-deploy text-to-speech solution that is particularly suitable for scenarios that need to run on resource-constrained devices.<\/li>\n<li data-track=\"273\">Project address: <u>https:\/\/github.com\/rhasspy\/piper<\/u><\/li>\n<li data-track=\"274\"><strong>GradTTS<\/strong> As a representative of flexible architecture models, it provides an efficient and high-quality text-to-speech solution by combining advanced technologies such as diffusion probability model, generative score matching and monotone alignment search. Its flexible framework and broad application prospects make it an important milestone in the current text-to-speech field.<\/li>\n<li data-track=\"275\">Project address: <u>https:\/\/github.com\/WelkinYang\/GradTTS<\/u><\/li>\n<li data-track=\"276\"><strong>Matcha-TTS<\/strong> It provides an efficient, natural and easy-to-use non-autoregressive neural TTS solution suitable for a variety of application scenarios.<\/li>\n<li data-track=\"277\">Project address: <u>https:\/\/github.com\/shivammehta25\/Matcha-TTS<\/u><\/li>\n<\/ul>\n<h1 class=\"pgc-h-arrow-right\" spellcheck=\"false\" data-track=\"278\">at last<\/h1>\n<p data-track=\"279\">Anyone who has come into contact with TTS knows that the text-to-speech effect is particularly stiff, with obvious word and sentence breaks, no emotion at all, and a robotic feel. These are just some of the problems.<\/p>\n<p data-track=\"280\">But ChatTTS brought me a big surprise.<strong>Build quality<\/strong>Come and see,<strong>The quality of its generation is very similar to the feeling of human speech, with laughter, crying, and pauses.<\/strong>, and it will breathe heavily. Of course, it still has many shortcomings, such as long generation time, missing sentences, and sometimes even unable to generate a complete sentence, but this will not prevent it from moving forward.<\/p>\n<p data-track=\"281\">The ChatTTS project not only achieved new technological breakthroughs, but also opened up new application possibilities for speech generation technology. The detailed sample code and documentation provided provide developers and technology enthusiasts with a wide range of exploration and experimental space.<\/p>\n<p data-track=\"282\">We expect future projects to further improve the sound quality and increase the choice of speaker voices, bringing more innovations to the field of real-time speech generation.<\/p>\n<h1 class=\"pgc-h-arrow-right\" spellcheck=\"false\" data-track=\"283\">References:<\/h1>\n<p data-track=\"284\"><u>https:\/\/github.com\/2noise\/ChatTTS<\/u><\/p>\n<p data-track=\"285\"><u>https:\/\/chattts.com\/<\/u><\/p>\n<p data-track=\"286\"><u>https:\/\/ai.gopubby.com\/chattts-an-incredible-open-source-tts-model-for-dialogues-7ed71d55944f<\/u><\/p>\n<p data-track=\"287\"><u>https:\/\/www.bilibili.com\/video\/BV1zn4y1o7iV\/?vd_source=c51b77ea0e8c6261e9039c2c3d6b6410<\/u><\/p>\n<p>&nbsp;<\/p>","protected":false},"excerpt":{"rendered":"<p>Recently, ChatTTS, a speech generation project, has been gaining attention on GitHub. As of June 4, it has already gained 18.9 thousand stars in 6 days \ud83c\udf1f. Various netizens called it too awesome! According to this trend, it will soon break through 20,000stars. Website: https:\/\/github.com\/2noise\/ChatTTS ChatTTS is a text-to-speech model (TTS i.e. Text-To-Speech) specially designed for conversation scenarios, which supports multiple languages, including English and Chinese, and the largest model uses 100,000 hours of Chinese and English data for training, and the version open-sourced in Huggingface is the one with 40,000 hours of training and no sft. To ensure<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[149,144],"tags":[2876,385,219,2928,2929],"collection":[],"class_list":["post-12342","post","type-post","status-publish","format-standard","hentry","category-jiaocheng","category-baike","tag-chattts","tag-github","tag-219","tag-2928","tag-2929"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/12342","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=12342"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/12342\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=12342"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=12342"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=12342"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=12342"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}