VideoMT turns vision transformers into fast video segmenters

Duane Villanueva • Mar 4, 2026 • 1 min read

VideoMT is a new method that repurposes standard vision transformers into high-speed video segmentation models. Instead of training a separate network, the approach propagates queries across frames, letting the transformer track and segment moving objects over time.

The system can segment people, animals, and tools in complex scenes while running up to 160 frames per second, which is 5 to 10 times faster than several previous methods. That speed makes VideoMT a candidate for real-time applications like video editing tools, AR filters, and robotics perception pipelines.

The researchers also highlight that the method works with off-the-shelf transformer backbones, reducing the need for specialized architectures.

Duane Villanueva

Communication graduate, closet cynic, and kid at heart. Duane is a rare person to find, quite literally. He often takes to himself but has proven his mettle in tech media with his quick wits. Well, the portfolio of scriptwriting, web content, and public relations help too, we suppose. As a homebody, he often spends his time on the streaming platform Twitch or ‘farming’ gaming clips with friends. He is also an avid fan of round glasses and anything relative to blueberries.

175 posts

VideoMT turns vision transformers into fast video segmenters

Comments

Cancel reply