Byte beats Vidi2, video understanding exceeding Gemini3 pro

Byte beats the new AI model Vidi2, a multimodular large-linguistic model of 12 billion parameters dedicated to video understanding. It handles a few hours of raw material, understands the storyline, and then produces a complete TikTok or film clip based on a simple tip. The key to this breakthrough is video understanding. Vidi2 added a fine time and space position (STG) feature to the second edition, allowing for the identification of the time stamp and the boundary of the target object in the video. A text query is given, not only to find the corresponding time period, but also to mark the location of specific objects within these time frames. (Foxnet search)

Search