Ali Tongyi open-sources its first audio generation model ThinkSound: think like a "professional sound engineer"

July 4 News.AliThe "Tongyi Big Model" public website announced today that Tongyi Lab's firstAudio Generation Model ThinkSound Now officiallyOpen SourceIt will break the limitations of the imagination of the "silent screen".

Ali Tongyi open-sources its first audio generation model ThinkSound: think like a "professional sound engineer"

ThinkSound applies CoT (Chain-of-Thought) to audio generation for the first time, allowing AI to learn to"Think it through" step by stepThe relationship between screen events and sound that enables theHigh fidelity, strong synchronizationSpatial audio generation - not just "voice over", but really "hear the picture".

In order for AI to learn to "listen logically", the Tongyi Labs speech team has built AudioCoT, the first multimodal audio dataset that supports chained reasoning.

AudioCoT incorporates 2,531.8 hours of high-quality samples from multiple sources including VGGSound, AudioSet, AudioCaps, Freesound, and more. The data covers everything fromAnimals chirping, machinery running to ambient sound effectsA variety of real-world scenarios such as these provide a rich and diverse training base for the model. In order to ensure that each piece of data can truly support the structured reasoning ability of AI, the research team designed a set of refined data screening processes, including multi-stage automated quality filtering andManual sample calibration of not less than 5%The data is not only a good source of information, but also a good source of information about the data, and the layers of gatekeepers to ensure the overall quality of the data set.

On top of this, AudioCoT has also designed object-level and command-level samples for interactive editing to meet ThinkSound's needs for refinement and editing capabilities in subsequent phases.

ThinkSound consists of two key components: a Multimodal Large Language Model (MLLM) that specializes in "thinking" and a Unified Audio Generation Model (UAGM) that focuses on "auditory output". Together, these two modules enable the system to analyze the content of the screen in three stages, and ultimately generate precisely aligned audio effects - from understanding the overall picture, to focusing on specific objects, to responding to the user's commands.

According to officials, despite significant advances in end-to-end video-to-audio (V2A) generation technology in recent years, it is still difficult to truly capture the dynamic details and spatial relationships in the picture. LikeWhen do owls hoot and take off, and is there a scuffling sound when a branch shakes?and other visual-acoustic correlations are often overlooked, resulting in the generation of audioToo generic or even misaligned with key visual events, it is difficult to meet the stringent requirements for temporal and semantic coherence in professional creative scenarios.

The core problem behind this is that AI lacks a structured understanding of the events of the picture, and is unable to analyze, reason, and re-synthesize the sound step-by-step like a human sound engineer.

1AI Attached open source address:

  • https://github.com/FunAudioLLM/ThinkSound
  • https://huggingface.co/spaces/FunAudioLLM/ThinkSound
  • https://www.modelscope.cn/studios/iic/ThinkSound
statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

Launch of "Star Stream" Design Agent

2025-7-4 11:53:12

Information

Apple Releases DiffuCode-7B-cpGRPO Programming AI Model: Based on Qwen 2.5-7B, Can Generate Code Out of Order

2025-7-5 13:49:54

Search