The Qwen2 – VL, launched by the Qun-Tuan, has two key structural improvements. One is to achieve dynamic resolution support, which allows for the treatment of arbitrary resolution images that do not need to be divided into blocks and are closer to human visual perception; and the other is to use Multimedia Rotary Production Embeding (M – ROPE) to enable language models to capture and integrate text, visual and video location information at the same time, as multimodular processors and deducers. On a scale of 7B, Qwen2 – VL – 7B retains support for image, multi-image and video input to provide “competitive” performance in more cost-effective model size; Qwen2 – VL – 2B model optimizes potential mobile deployments with a 2B parameter size, but it is good for image, video and multilingual understanding。
address:
https://www.modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct
https://www.modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct
