Ali Tongyi Releases Small-Size Multi-Modal Models

Recently, Ali Tongyi Qianqian released a new small-size multimodal model "Qwen2.5-Omni-3B", specifically: Qwen2.5-Omni-3B is a new model that responds to the demand of developers for lightweight GPU adaptations; compared with Qwen2.5-Omni-7B, the 3B version consumes more than 50% less memory for long context sequences (~25k tokens) and can support up to 30 seconds of audio/video interaction on an average 24GB consumer GPU; the 3B model retains the 7B model's 90% or more. Compared with Qwen2.5-Omni-7B, version 3B consumes more than 50% less memory in processing long context sequences (~25k tokens), and can support audio-video interactions of up to 30 seconds on an average 24GB consumer GPU; version 3B retains the multimodal comprehension capability of the 7B model of more than 90%, and the naturalness and stability of the speech output is the same as that of the 7B version. Currently, Qwen2.5-Omni-3B has been open-sourced on MagicBuild Community and HuggingFace.

Search