10,000 frames? Single card? Zhiyuan open source lightweight ultra-long video comprehension model Video-XL-2

Video-XL-2, an open source lightweight ultra-long video comprehension model from Wisdom Source Research Institute, can efficiently process up to 10,000 frames of video input on a single card; the model consists of three parts, namely, a visual coder, a dynamic token synthesis module, and a large language model, and adopts a four-phase incremental training method, as well as introduces a segmented pre-loading strategy and a double-granularity KV decoding mechanism; Video-XL-2 surpasses all lightweight open-source models on the mainstream evaluation benchmark, and takes only 12 seconds to encode 2048 frames of video. Video-XL-2 outperforms all lightweight open source models in mainstream evaluation benchmarks, and it takes only 12 seconds to encode 2048 frames of video, which can be applied to movie and TV content analysis, abnormal behavior monitoring and other scenarios.

Search