2084: Making Stable Diffusion Smaller & Better Pitches Today!
Some thoughts on two projects I'm working on.
There are a lot of really cool models about in the world today. My favorite and the one I use for making the images I use in this blog, is Midjourney. Their new version 4 just blows me away with how good it is. There’s also Stable Diffusion, GPT-J, and a bunch of other absurdly impressive models. But the commonality between all these useful models is just how massive they are, and correspondingly, how long they take to perform inference, which makes them unoptimal for most use cases and slow to use - kinda like how computers used to be.
Now you can wait for Moore’s law to make this irrelevant, but in the mean time there are a lot of methods available to make models “smaller”. Or rather, usually “make the weight matrix more sparse”, which isn’t quite the same thing. To explain simply, a neural network is usually modeled in code as a series of matrix multiplications with a non linear step between layers. Now this makes for easy implementation, but usually just straight up having weights be in a matrix, means that “neurons” with 0 weights are also considered in each multiply step by how matrix multiplication is calculated. Therefore a lot of the techniques which perform so called “unstructured learning”, where some neurons are zeroed out, don’t have a significant impact on inference time out of the box - you need to replace the matrix multiply with an alternative which ignores zero weights. This is what I’m working on with my sparse neural network accelerator, something which takes this sparse weight matrices and performs speedy inference.
Anyhow, the biggest technique used today is something called learning rate rewinding, described in this paper. It’s pretty good, and very useful for most pruning. However the lab my lab is partnered with, Lawrence Livermore National Laboratory developed a completely different technique which is super cool. The idea is that they proved that given a sufficiently overparameterized neural network(like stable diffusion), which you assign random weights to, you can find a sparse subnetwork that achieves comparable accuracy to the original trained neural network, using an algorithm they call biprop. They also tested out ways of using learning rate rewinding with their methods to find accurate subnetworks in randomly generated networks, in this paper.
Now of course the issue is that, no matter whether you use biprop or LRR, it’s still slow to prune a massive model like Stable Diffusion. And unfortunately, for the moment, I don’t have access to my own personal supercomputer. Therefore, I thought of an alternate way to use these techniques on something like Stable Diffusion - essentially, run the model on a dataset once, and record the intermediate activations in the model layers, and then remove a subset of layers and use the activations as training data for when you use biprop or learning rate rewinding. The idea behind it is that if a pruned series of layers(call it a subnetwork) give the same output on the same input as the bigger subnetwork, or closely the same, then if you substitute in the pruned subnetwork for the original subnetwork, the output should be approximately the same. The advantage of this technique is that for the training of the subnetwork, you only have to perform inference with the subnetwork, and can ignore the rest of the model, which should drastically speed up training time. Given that our lab has achieved sparsity levels of 95% with resnet-18, this should let us achieve a 95% reduction in the parameter count of the larger Stable Diffusion layers, and therefore achieve much quicker overall inference time. This week I plan to run some experiments using this technique, and see whether the idea works in real life - and of course post the data on the blog.
Talking about ideas working in real life, last time I talked about the issue of datasets in models. The issue of course being that a lot of models are too general, and the datasets are also too general to solve the specific problems that crop up in business - it is nice to know whether a sentence is positive or negative, but what does that mean in business? I feel like in the future as the field evolves, there’ll have to be a whole rethinking of the datasets and corresponding structure as the field evolves.
And so I thought, that an interesting application, taking inspiration from this, could be a modification of sentiment analysis on speech to perform the following task: mapping between pitches given to investors and money invested/whether they would invest or not. There’s a bunch of really smart models today in this field of speech sentiment analysis, but the issue is that they all tend to be generic - only give you outputs like “what emotion is the speech expressing, happiness, sadness, anger etc”, which is interesting from an academic standpoint, but not so much from a business standpoint. There was an article on something similar before, but they used human judges and not AI models. I think there’s definitely a space for a model with a narrower focus. The only issue will be getting data. I was thinking of cutting up pitch competition videos from YouTube, like MIT $100k but it is something that I will have to research or potentially reach out to people too. This is another field in which I think I’ll be doing some experiments over the coming weeks, which I will post here.
In 2084, of course, no business will be without its “financial advisor robot”, for performing trust and reliability verifications of people based on how they speak and act, and no VC will be without its “investment bot”, that’ll try to predict whether or not a pitch will be successful or not. It’ll be an interesting environment to raise money in, one that could potentially have a lot more feedback than the slightly opaque environment that dominates today.