Sony is working on MMHNet, an AI model that generates sound effects and ambient audio directly from video input. The system is designed to keep audio better aligned with visual events than earlier approaches such as MMAudio.
MMHNet uses a multimodal hierarchical architecture with Mamba-based components to maintain coherence over up to five minutes of audio. In examples, it can produce synchronized Foley for actions like balls hitting a table or quick action cuts.
Sony plans to release code, which could bring AI-assisted sound design into video editing tools and creator workflows.
Comments
No comments yet. Be the first to share your thoughts!