Meta AI Releases WavFlow, a Multimodal Audio Generation Framework That Produces Synchronized High-Fidelity Audio from Video and Text

May 21, 2026

GitHub

Article image for Meta AI Releases WavFlow, a Multimodal Audio Generation Framework That Produces Synchronized High-Fidelity Audio from Video and Text

Summary

Meta AI unveils WavFlow, a groundbreaking multimodal framework that generates synchronized, high-fidelity audio directly from video and text inputs, matching top latent-based models on major benchmarks while making its codebase publicly available on GitHub.

Key Points

WavFlow is a new multimodal audio generation framework from Meta AI that produces synchronized, high-fidelity audio from video and text inputs directly in raw waveform space, bypassing latent compression through waveform patchifying and amplitude lifting.
The system achieves performance on par with established latent-based methods on VGGSound and AudioCaps benchmarks, supporting text-only, video-only, and combined video-plus-text generation modes.
The codebase is publicly available on GitHub under a CC-BY-NC 4.0 license, though production-trained checkpoints cannot currently be released due to organizational policy, with a foundation checkpoint trained on open-source data in development.

Meta AI Releases WavFlow, a Multimodal Audio Generation Framework That Produces Synchronized High-Fidelity Audio from Video and Text

Summary

Key Points

Tags