August 12 News.Zhipu AI today launched the world's most effective 100B-classOpen Sourcevisual inference model GLM-4.5V (total parameters 106B, activation parameters 12B), and synchronized with Hugging Face open source in the Magic Hitch community. In addition, the API call price is as low as $2/M tokens for input and $6/M tokens for output.

1AI learned from the official introduction that GLM-4.5V is based on GLM-4.5-Air, the flagship text base model of the new generation of Smart Spectrum, and continues the technical route of GLM-4.1V-Thinking, which has reached the performance of the same level of open source model SOTA in the 41 public visual multimodal lists, and covers common tasks such as image, video, document understanding, and GUI Agent. It covers common tasks such as image, video, document understanding, and GUI agent.
Beyond the multimodal list, it places more emphasis on the model's performance and usability in real-world scenarios.GLM-4.5V By efficiently mixing the trainingAbility to cover a wide range of visual content, which enables full-scene visual reasoning, including:
- Image Reasoning (Scene Understanding, Complex Multi-graph Analysis, Location Recognition)
- Video understanding (long video split analysis, event recognition)
- GUI tasks (screen reading, icon recognition, desktop operation assistance)
- Complex charts and long documents parsing (research paper analysis, information extraction)
- Grounding capabilities (pinpointing visual elements)
at the same time,The model has a new "Thinking Mode" switch, which allows users to flexibly choose between fast response and deep reasoning.The GLM-4.5V is a desktop assistant application that balances efficiency and effectiveness. In order to help developers intuitively experience the modeling capabilities of GLM-4.5V and create their own multimodal applications, Smart Spectrum AI has synchronously open-sourced a desktop assistant application.
The desktop application can take screenshots and record screenshots in real time to get screen informationGLM-4.5V can handle a variety of visual reasoning tasks, daily processing such as code assistance, video content analysis, game answers, document interpretation and other types of visual tasks, and become a partner that can look at the screen and work and play with you. We also hope that through the model open source and API services, we can empower more developers with ideas to utilize their creativity and imagination based on the multimodal base model, and turn the scenes in the past sci-fi movies into reality.