It was sure to happen, given the increasing success of music generation, and the absolutely runaway success of latent diffusion models in general, that someone would apply the same techniques to audio, but even so AudioLDM’s examples blew me away. AudioLDM is a general audio generator, that outputs audio waveforms based on a text description, that is based on a latent diffusion model. Unlike previous music generation models we’ve look at, this is more general, and weirdly that makes it sound better.
Music is rather structured and so your ears can easily catch the artifacts produced by most diffusion models, but with general audio it becomes a bit harder, and so a lot of the audio it produces sounds perfectly correct to my ears. The model itself will be released later, as they need to check about copyright reasons, but even as it is, it’s awesome. It’s even been trained only on a single GPU, which indicates how accessible AI has become. What a time to be alive!
Sources: