February 10th.Bean curdVideoWorld", an experimental video generation model jointly developed by Big Model team, Beijing Jiaotong University and University of Science and Technology of China, is open-sourced today. Unlike mainstream multimodal models such as Sora, DALL-E, and Midjourney, VideoWorld realizes for the first time in the industry that you don't need to rely on a language model to know the world.

It is stated that most of the existing models rely on language or labeled data to learn knowledge, and rarely involve the learning of purely visual signals. However, language does not capture all knowledge in the real world. For example, complex tasks such as origami and bow tie are difficult to express clearly through language. VideoWorld, on the other hand, removes the language model and realizes unified execution of comprehension and reasoning tasks.
At the same time, it is based on a potentially dynamic model that can beEfficient compression of video frame-to-frame variation informationSignificant improvements in the efficiency and effectiveness of knowledge learning. Without relying on any mechanism of enhanced learning search or reward function, VideoWorld has reached the level of professional section 5 9x9 and is able to perform robotic missions in a variety of environments。
1AI Attach the relevant address below:
-
Link to paper:https://arxiv.org/abs/2501.09781
-
Code Link:https://github.com/bytedance/VideoWorld
-
Project home page:https://maverickren.github.io/VideoWorld.github.io