2084: Making Models Faster
A brief update on my convolutional neural network accelerator project
So today me and my lab partner Mattia got together to discuss how the architecture for our sparse convolutional neural network is going to look like. It involved a lot of drinking. Coffee. But we finally decided on an “input-stable” approach to convolution, where we would loop over the kernels and the output channels, with some tile-flexibility.
Now, to explain what this means, first I need to explain how a 2D convolution works. A 2D convolution works like the following animation:
Now this shows a single input channel and a partial sum for the output channel. To generate an output channel, you need to loop over all the input channels, performing this same operation with the relevant kernel at each step, and summing together everything at the end. What’s useful about a convolution is that it helps a neural network make decisions at a point with regards to information at the points around it, by adding the points around it to that point. Think of it like a filter that extracts the useful parts of an image. The following shows a bird which was passed through various convolutional “layers”. You can see how parts of the bird was highlighted and not highlighted. As you can think, this highlighting makes it easier for a neural network to figure out what it is.
But the number of input and output channels can get pretty big. So evaluation can be slow. To try to work on this problem, the team at LLNL has managed to make the convolutional layers of their models “sparse”. What this means is that they’ve removed as many elements from the matrix of their kernels as possible, to where there’s less than 5% left of the original at each layer. Now when there’s only 5% of the kernel elements left, there’s no point in storing the kernels as a bunch of small 3x3 or whatever matrices, since they’ll be mostly blank. No, you’d rather store them as a list of tuples, in the following format (output, input, i,j,val).
But now the question facing me and Mattia was, how will we compute the convolution of this list of tuples with a dense bunch of input matrices, or rather, how will we construct the architecture?
What we decided in the end was that we would create a big matrix of multipliers. We would sort the kernels by input channel. And then we’d loop through the input channels, and at each point, we’d loop through the kernel tuples for that input channel. For each kernel, we’d shift the input by i,j, and then compute a pointwise multiply of the input by val, and then accumulate the partial sum output matrix to off-chip memory, the correct memory area for that kernel, adding the partial sum to the result of previous calculations. After this is done, we’d have calculated the convolution for each output layer.
Of course, we’d need to chunk the input: feed it through the matrix of multipliers, called KMACMAT, piece by piece to calculate piece by piece the output partial sum to accumulate, but after we’re done, we’d have calculated all the output channels. The reason for doing it like this since because the new kernel is quicker to load than a new input, and since a lot of the kernels are zero, a lot of input channels will be skipped over anyhow, but not many of the output channels.
Now this calculates convolution, but convolutional neural networks also have Max Pool and fully connected layers, and so in order to support calculating those, we decided to extend the multipliers by making them Multiplier Accumulators instead, controlled by an external control bit whether they’re zeroed or not, and have another bit control which cardinal direction they should stream their output too. This is called a “tile-flexible” paradigm, and helps to support a large variety of different operations, although we’re planning to make it rather simple.
Of course, we’re still working on the architecture, but at the moment this is the basic idea. Now all that remains to be done is using our lab’s tools ACT, to write out the Dataflow(a sublanguage in the toolset) for the app. And then debug it for hours, since hardware is quite tricky to debug. Especially since there’s not an easy way to load large chunks of data into a simulated chip in the language, so I’ll have to write a data loader for it. But it is still a relatively simple yet really powerful toolset to use, and makes it quite easy to design asynchronous systems.
There’s so much work being done on these sparse neural network accelerators, that it’ll probably be everywhere very soon. In combination with ever cheaper silicon, and smaller models, you’d see soon a $20 chip to run the big models, and at that point, all bets are off the table as to how far AI will reach. It’ll be like Microsoft’s a computer on every table in the year 2084.