April 3 News.Meituan (Japanese company)It was released yesterdayAudio Generation Model LongCat-AudioDiT and synchronise open source 1B and 3.5B versions。

It was described that LongCat-AudioDiT, which was directly modelled in wave-shaped subspace, only needed a wave-forming decoder (Wav-VAE) and a proliferation Transformer (DiT) to eliminate the accumulation of errors from the root causes of multistage cascades。
Training - Logic Alignment: Force the reset of the hidden variable of the hint area to the real value in each step of the reasoning to resolve the long-standing problem of sound drift。
SELF-ADAPTATION PROJECTORS (APG)REPLACES THE TRADITIONAL NON-CLASSIFIER GUIDE (CFG), DECOMPOSES THE GUIDANCE SIGNAL TO A POSITIVE AND PARALLEL MASS, PRESERVES THE USEFUL, INHIBITS THE POOR, AND AVOIDS THE "SATURATION" OF THE SPECTRUM WHILE INCREASING THE SOUND-COLOR SIMILARITIES。
In the Seed benchmark test, the LongCat-AudioDiT-3.5B speaker-similarity (SIM) reached 0.818 in the Chinese test set (Seed-ZH) and the Chinese hard-word set (Seed-Hard) reached 0.797, exceeding models such as Seed-TTS, CosyVoice 3.5 and MiniMax-Speech to achieve current SOTA performance。
GitHub: https://github.com/meituan-longcat/LongCat-AudioDiT
Hugging Face: https://huggingface.co/meituan-longcat/LongCat-AudioDiT