June 28, 2012 - AliCloudThousand Questions on Tongyihas just posted an article announcing the launch of the latest Qwen VLo-- a multimodal unified understanding and generation model that users can experience via Qwen Chat (chat.qwen.ai).

- This newly upgraded model not only "understands" the world, but also carries out high-quality re-creation based on the understanding, truly realizing the leap from perception to generation.
Qwen VLo is described as being able to build an entire picture in a progressive generation manner, from left to right, top to bottom, in a gradual and clear manner.
During the generation process, the model will continuously adjust and optimize the predicted content, thus ensuring a more harmonious and consistent final result. This generation mechanism not only improves the visual effect, but also brings users a more flexible and controllable creation experience.
Officially, Qwen VLo uses dynamic resolution training and supports dynamic resolution generation. Both on the input and output side, the model supports image generation with arbitrary resolution and aspect ratio.
This means that users are no longer limited to a fixed format, and can generate image content to suit different scenarios, whether it's a poster, illustration, web banner or social media cover.
In addition, Qwen VLo introduces an innovative new generation mechanism: a clear, step-by-step process from top to bottom, left to right. This mechanism not only improves the generation efficiency, but also makes it especially suitable for the task of generating long paragraphs of text that require fine control. For example, when generating an advertisement design or a comic book subplot with a large amount of text, Qwen VLo generates and modifies it step-by-step. This incremental approach allows the user to observe the process in real time and make adjustments as needed to get the best results.
AliCloud official reminded that Qwen VLo still belongs to the preview stage, there are still a lot of shortcomings, in the process of generating may not be in accordance with the facts, not completely consistent with the original picture of the problem, the development team is still continuing to iterate.
- Qwen VLo has been fully upgraded in terms of raw multimodal understanding and generation capabilities, which significantly enhances the depth of understanding of image content, and on top of that, achieves more accurate and consistent generation results.
- Here are the core highlights of Qwen VLo:
- 01 More accurate content understanding and re-creation
- pastMultimodal ModelThe generation process is prone to semantic inconsistencies, such as mistakenly generating cars as other types of objects, or failing to retain key structural features of the original image. Qwen VLo maintains a high degree of semantic consistency in the generation process through its ability to capture more detail. For example, when a user inputs a photo of a car and asks to "change the color", Qwen VLo not only accurately identifies the car model, but also preserves its original structural features while naturally switching color styles, so that the generated result meets expectations without losing the sense of reality.
- 02 Support for Open Directive Edit Modification Generation
- Qwen VLo responds to open-ended commands in natural language, such as "change the style of this picture to Van Gogh", "make this picture look like a 19th-century vintage", or "add a clear sky to this picture", and produces results that meet user expectations. "Qwen VLo flexibly responds to these open-ended commands and produces results that meet the user's expectations. Whether it's art style migration, scene reconstruction, or detail retouching, the model can handle it with ease. Even some traditional visual perception characters such as predicted depth map, segmentation map, detection map and edge information can be easily accomplished by editing commands. Further, like many more complex commands, such as a command that includes modifying objects, modifying text, and replacing backgrounds at the same time, the model can also be easily accomplished.
- 03 Multi-language command support
- Qwen VLo supports multi-language commands, including Chinese and English, breaking down language barriers and providing a uniform and convenient interaction experience for users around the world. No matter which language you use, simply describe your needs and the model will quickly understand and output the desired results.