Researchers have introduced VBVR, also known as the Very Big Video Reasoning suite. This is a new framework that lets text-to-video models actually reason about what they see instead of just generating pretty footage.
Built on top of the open-source Open-Sora “one” model, VBVR can follow visual instructions. As such, it can be used to solve puzzles and track objects directly inside a video scene.
In demos, VBVR can identify and circle a specific character, solve simple logic and shape puzzles, and track an agent. It moves around collecting dots in a grid. The team also built a dedicated video reasoning benchmark, where VBVR hit around 68.5 percent accuracy.
For reference, many existing models stayed below 50 percent. To push the ecosystem forward, they released both the VBVR framework and a one-million-example video reasoning dataset. However, it weighs in at roughly 310 GB.

Comments
No comments yet. Be the first to share your thoughts!