I was talking to a good friend of mine about AI models and GPT-3 today. He's an excellent musician, and after I told him about my view of AI models to some extent as tools that'll do away with a lot of the tedious parts of many things we do today in the creative arts and engineering, he asked me whether there existed a good AI model out there that could do AI mixing, as that was something that he found to take a while to do.
I've never really looked at music AI before. I knew that it is a wide field, and that there's a lot of research going on, so today I'll just do a brief look at some generative models out there, and the AI models that are out there.
What's interesting about music, and something I didn't know before reading the article on Jukebox, OpenAI's music generation AI, is that music is super information dense. A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps which all need to be generated if you want something that generates raw audio data, which is prodigiously huge - uncompressed audio for a 4 minute song can clock in at 42 mb. In addition, all of these 10 million timesteps need to be correlated with each other, which makes the resulting model rather complex. This is a serious bottleneck for AI driven direct creation of music. Text and images usually take much less timesteps and the resulting file is also usually a lot less than 42 mb. Therefore a lot of the work on AI music generation has been symbolic music generation, where you generate MIDI files or notes, which are then translated into music. However this has the issue that a lot of the subtler parts of live music is lost - all the subtle resonances and suchlike.
In addition, description of music is difficult, and it tends to be somewhat poorly labelled, more so than images with captions, which can be scraped from the internet, making large scale music models difficult.
There's an interesting exploration of the intricacies and issues involved in this article, which is well worth reading if you have the time.
Anyway, the very simplified solution to this is to use an autoencoder for the generative models, something which maps from a lower dimensional space back up to the high dimensional audio space.
Now, as far as real complete AI music generation go, Jukebox seems to be pretty much the only system that provides anything that's close to what you want. Give it an artist, genre and lyrics, it'll create music. But the music is a bit scratchy. There are services like Sounddraw which also provides quite generic music, but it's not that good. Most of the work in this field seems to still be largely tied up in research, with competitions like the AI song contest indicating that there's quite a lot of tools available, but they're mostly Python packages and GitHub repositories, and thus not really available to the public. In addition, there's a mix of tools used for different parts of a song, like GPT-3 for the lyrics and such, and not one single tool used to generate a song. The songs are super nice.
For symbolic music generation, MuseNet from OpenAI seems to also be pretty good. It's pretty much your standard transformer architecture with some tweaks.
I was surprised when I looked into this that it seems like at the moment, the field isn't as advanced as text to image has become. Music surprisingly seems to be more difficult to generate than images. In the future of course, this might change and the songs in the AI song contest sound uncannily human, but for now, musicians have probably the most job security of anyone in the creative arts.
Now for mixing and equalizers. For cleaning up an existing mix, Gullfoss and Soothe seem to be the best option. Soothe is more for balancing out a mix, while Gullfoss is more complete tool with parameters that you can set to determine what type of mix you want. As they are proprietary however, it's hard to know what models or algorithms they are using. And they are expensive, clocking in at $140 for soothe and $120 for Gullfoss.
For complete AI mixing, there's an interesting service from RoexAudio, which performs complete mixing and mastering in one passthrough with some simple settings. It looks promising but possibly needs some testing.
Then there's also Neutron, which is a more proprietary solution that seems to be more professional. This article I found seems to say that it's a good solution to the problem of mixing.
With regards to all this that I saw, it seems that there's a definite gap in the market for a AI music orchestrator that pulls together some of these tools for greater ease of use, as well as a lot more training and larger models needed to generate the audio itself - audio is harder to generate than images. But the rapid growth and improvement in Text To Speech means that it is indeed probably possible in text to music. In 2084, we might see music for every occasion, generated to fit the mood of whatever going on, plugged into a sentiment analyser that reads what people would want to hear. Or a playlist generator, that makes an infinite list of songs from the name of the playlist alone.