Message September 19, yesterday afternoonWall-facing intelligenceDeclares the “Small Steel Gun” series to be refreshed: the VoxCPM voice generation base model of 0.5B parameter sizes was introduced. The VoxCPM base model for voice generation was officially launched at the Human Voice Interactive Laboratory of the International Graduate School of Shenzhen, Singhua University. The model parameter size is 0.5B, with an industry SOTA level in terms of voice naturality, sound similarity and rhythm performance。

Performance performance: RTF ≈ 0.17, supporting current output VoxCPM performed excellently in the Seed-TTS-EVAL test, with extremely low word error rate, and similarity in the sound cloning mission up to real level. The NVIDIA RTX 4090 graphic card speeds the reasoning of RTF ≈ 0.17 to meet high-quality real-time interactive needs。
Hearing experience: Emotions, accents, rhythmic models automatically select the right sound styles for text content, generating a variety of audio scenes such as weather, pre-war speeches, dialect anchors, etc. It supports a bilingual Chinese-British reset, with very few samples to "repeat" or even to read mathematical formulas and symbols。
Technology architecture: Integration of language modelling and diffusion into VoxCPM. The core modules include LocEnc, TSLM, RALM and LocDiT, which achieve the efficient generation and restructuring of voice continuum features through VAE decodors。
VoxCPM is now open on various platforms such as GitHub, Hugging Face, and developers are free to download and experience, and on-line PlayGround is fast-tracked and audio sample pages are synchronized。
Model links:
Github: https://github.com/OpenBMB/VoxCPM
Hugging Face: https://huggingface.co/openbmb/VoxCPM-0.5B
ModelScope: https://modelsscope.cn/models/OpenBMB/VoxCPM-0.5B
PlayGrowd Experience: https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo
Audio sample page: https://openbmb.github.io/VoxCPM-demopage