I’ve been playing around with Riffusion lately. It’s a music generation AI, that leveraged a text to image AI, Stable Diffusion to generate spectrograms of music which it plays. It’s quite good, and something I’d recommend playing around with. Especially the pop and rock songs, its odd to almost understand words, since it mostly just captures the underlying vocal melody.
However, what this shows more than anything is the generalizable nature of these large models. Stable Diffusion was never intended to generate music, but the fact that it can indicates how varied its knowledge is - just like ChatGPT, it’s proving to be adept at tasks at which its not been trained. This more and more leads me to believe that this is essentially very close to AGI - if a image AI model can generate music, why not robot commands? Or anything else describable by a 300x300 grid of 32 bit numbers, which is a very large amount of things. I’m excited to see what people come up with next with these models. In 2084, we might not even have the idea of specific models for tasks - it’ll all be one big model used for everything, a super mind, limited only by the human imagination.