Sony MMHNet in development, auto-generates Foley audio from video

Duane Villanueva • Mar 3, 2026 • 1 min read

Sony is working on MMHNet, an AI model that generates sound effects and ambient audio directly from video input. The system is designed to keep audio better aligned with visual events than earlier approaches such as MMAudio.

MMHNet uses a multimodal hierarchical architecture with Mamba-based components to maintain coherence over up to five minutes of audio. In examples, it can produce synchronized Foley for actions like balls hitting a table or quick action cuts.

Sony plans to release code, which could bring AI-assisted sound design into video editing tools and creator workflows.

Duane Villanueva

Communication graduate, closet cynic, and kid at heart. Duane is a rare person to find, quite literally. He often takes to himself but has proven his mettle in tech media with his quick wits. Well, the portfolio of scriptwriting, web content, and public relations help too, we suppose. As a homebody, he often spends his time on the streaming platform Twitch or ‘farming’ gaming clips with friends. He is also an avid fan of round glasses and anything relative to blueberries.

194 posts

Sony MMHNet in development, auto-generates Foley audio from video

Comments

Cancel reply