AudioX is a new “unified” generative audio model that can produce sound effects and music from text, images, and video, as well as perform advanced tasks like audio inpainting and extension. From a simple prompt, it can synthesize scenes such as “thunder and rain during a sad piano solo” or “a machine gun fires twice followed by silence, then waves,” accurately following timing and event order.
The model also supports text‑to‑music for background tracks, though current musical quality is described as serviceable rather than state‑of‑the‑art. More importantly, AudioX can take video as input and generate synchronized soundtracks that respond to camera cuts and scene changes, demonstrating an understanding of temporal structure in visual content.
On the restoration side, AudioX can fill missing segments in speech or music, or extend an existing clip with stylistically consistent continuation. Benchmarks show it outperforming competing systems both in capabilities breadth and performance across modalities. The weights, totaling under 6 GB, are available in a GitHub repo with instructions for local deployment, making it accessible to developers running on consumer GPUs.
Comments
No comments yet. Be the first to share your thoughts!