Nvidia’s EgoScale teaches robots from human POV videos

Duane Villanueva • Mar 5, 2026 • 1 min read

Nvidia is developing EgoScale, a system that trains robots by learning from large-scale egocentric human video. The dataset includes more than 20,000 hours of POV footage labeled with actions and hand poses, covering tasks such as folding clothes, using tools, and handling everyday objects.

EgoScale uses a vision-language-action model that combines visual input, text instructions, and action outputs to teach robots how to perform complex, multi-step tasks. In demos, robots can follow natural-language commands to manipulate cards, open bottles, or fold shirts based on what the model learned from the POV data.

Nvidia plans to release code on GitHub, which could accelerate research into general-purpose household and industrial robots.

Duane Villanueva

Communication graduate, closet cynic, and kid at heart. Duane is a rare person to find, quite literally. He often takes to himself but has proven his mettle in tech media with his quick wits. Well, the portfolio of scriptwriting, web content, and public relations help too, we suppose. As a homebody, he often spends his time on the streaming platform Twitch or ‘farming’ gaming clips with friends. He is also an avid fan of round glasses and anything relative to blueberries.

185 posts

Nvidia’s EgoScale teaches robots from human POV videos

Comments

Cancel reply