{"id":38908,"date":"2025-07-04T19:18:13","date_gmt":"2025-07-04T11:18:13","guid":{"rendered":"https:\/\/www.1ai.net\/?p=38908"},"modified":"2025-07-04T19:18:13","modified_gmt":"2025-07-04T11:18:13","slug":"%e9%98%bf%e9%87%8c%e9%80%9a%e4%b9%89%e5%bc%80%e6%ba%90%e6%97%97%e4%b8%8b%e9%a6%96%e4%b8%aa%e9%9f%b3%e9%a2%91%e7%94%9f%e6%88%90%e6%a8%a1%e5%9e%8b-thinksound%ef%bc%9a%e5%8f%af%e5%83%8f%e4%b8%93","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/38908.html","title":{"rendered":"Ali Tongyi open-sources its first audio generation model ThinkSound: think like a \"professional sound engineer\""},"content":{"rendered":"<p>July 4 News.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e9%98%bf%e9%87%8c\" title=\"[View articles tagged with [Ali]]\" target=\"_blank\" >Ali<\/a>The \"Tongyi Big Model\" public website announced today that Tongyi Lab's first<strong><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e9%9f%b3%e9%a2%91%e7%94%9f%e6%88%90%e6%a8%a1%e5%9e%8b\" title=\"[Sees articles with tags]\" target=\"_blank\" >Audio Generation Model<\/a><\/strong>\u00a0<a href=\"https:\/\/www.1ai.net\/en\/tag\/thinksound\" title=\"[See articles with [ThinkSund] label]\" target=\"_blank\" >ThinkSound<\/a> Now officially<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%bc%80%e6%ba%90\" title=\"[View articles tagged with [open source]]\" target=\"_blank\" >Open Source<\/a>It will break the limitations of the imagination of the \"silent screen\".<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-38909\" title=\"98acdbeaj00syvgp200had000u000j5p\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/07\/98acdbeaj00syvgp200had000u000j5p.jpg\" alt=\"98acdbeaj00syvgp200had000u000j5p\" width=\"1080\" height=\"689\" \/><\/p>\n<p>ThinkSound applies CoT (Chain-of-Thought) to audio generation for the first time, allowing AI to learn to<strong>\"Think it through\" step by step<\/strong>The relationship between screen events and sound that enables the<strong>High fidelity, strong synchronization<\/strong>Spatial audio generation - not just \"voice over\", but really \"hear the picture\".<\/p>\n<p>In order for AI to learn to \"listen logically\", the Tongyi Labs speech team has built AudioCoT, the first multimodal audio dataset that supports chained reasoning.<\/p>\n<p>AudioCoT incorporates 2,531.8 hours of high-quality samples from multiple sources including VGGSound, AudioSet, AudioCaps, Freesound, and more. The data covers everything from<strong>Animals chirping, machinery running to ambient sound effects<\/strong>A variety of real-world scenarios such as these provide a rich and diverse training base for the model. In order to ensure that each piece of data can truly support the structured reasoning ability of AI, the research team designed a set of refined data screening processes, including multi-stage automated quality filtering and<strong>Manual sample calibration of not less than 5%<\/strong>The data is not only a good source of information, but also a good source of information about the data, and the layers of gatekeepers to ensure the overall quality of the data set.<\/p>\n<p>On top of this, AudioCoT has also designed object-level and command-level samples for interactive editing to meet ThinkSound's needs for refinement and editing capabilities in subsequent phases.<\/p>\n<p>ThinkSound consists of two key components: a Multimodal Large Language Model (MLLM) that specializes in \"thinking\" and a Unified Audio Generation Model (UAGM) that focuses on \"auditory output\". Together, these two modules enable the system to analyze the content of the screen in three stages, and ultimately generate precisely aligned audio effects - from understanding the overall picture, to focusing on specific objects, to responding to the user's commands.<\/p>\n<p>According to officials, despite significant advances in end-to-end video-to-audio (V2A) generation technology in recent years, it is still difficult to truly capture the dynamic details and spatial relationships in the picture. Like<strong>When do owls hoot and take off, and is there a scuffling sound when a branch shakes?<\/strong>and other visual-acoustic correlations are often overlooked, resulting in the generation of audio<strong>Too generic or even misaligned with key visual events<\/strong>, it is difficult to meet the stringent requirements for temporal and semantic coherence in professional creative scenarios.<\/p>\n<p>The core problem behind this is that AI lacks a structured understanding of the events of the picture, and is unable to analyze, reason, and re-synthesize the sound step-by-step like a human sound engineer.<\/p>\n<p>1AI Attached open source address:<\/p>\n<ul>\n<li>https:\/\/github.com\/FunAudioLLM\/ThinkSound<\/li>\n<li>https:\/\/huggingface.co\/spaces\/FunAudioLLM\/ThinkSound<\/li>\n<li>https:\/\/www.modelscope.cn\/studios\/iic\/ThinkSound<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>On July 4th, Ali's \u201cAutilistic Model\u201d public address today announced that the first audio-generation model of the commons lab, ThinkSund, is now officially open and will break the imagination limit of \u201cquiet images\u201d. ThinkSund is the first time that CTT (Chain-of-Thought, Thinking) is applied to audio generation, so AI learns to \"think\" the relationship between video events and sound, so as to achieve high-level, high-synchronous spatial audio generation -- It's not just a \"read the picture\" but a real \"understand the picture.\" In order for AI to learn to \u201clogically listen\u201d, the voice team of the generalist laboratory built the first multimodular audio data set, AudioCot, to support chain reasoning. Aud<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[7135,219,1759,3681],"collection":[],"class_list":["post-38908","post","type-post","status-publish","format-standard","hentry","category-news","tag-thinksound","tag-219","tag-1759","tag-3681"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/38908","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=38908"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/38908\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=38908"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=38908"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=38908"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=38908"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}