About the talk
Most deep learning networks today rely on dense representations. This stands in stark contrast to our brains which are extremely sparse, both in connectivity and in activations. Implemented correctly, the potential performance benefits of sparsity in weights and activations is massive. Unfortunately, the benefits observed to date have been extremely limited. It is challenging to optimize training to achieve highly sparse and accurate networks. Hyperparameters and best practices that work for dense networks do not apply to sparse networks. In addition, it is difficult to implement sparse networks on hardware platforms designed for dense computations. In this talk we present novel sparse networks that achieve high accuracy and leverage sparsity to run 100X faster than their dense counterparts. We discuss the hyperparameter optimization strategies used to achieve high accuracy, and describe the hardware techniques developed to achieve this speedup. Our results show that a careful evaluation of the training process combined with an optimized architecture can dramatically scale deep learning networks in the future.
Subutai Ahmad is the VP of Research at Numenta, a research company that is applying neuroscience principles to machine intelligence research. Subutai brings experience in deep learning, neuroscience, and real time systems. His research interests lie in understanding and applying neuroscience insights from areas such as sparsity, dendrites, unsupervised learning, and sensorimotor prediction. Subutai holds a B.S. in Computer Science from Cornell University, and a Ph.D in Computer Science and Computational Neuroscience from the University of Illinois at Urbana-Champaign.View the profile
Hi everyone. My name is Samantha Alleman. I'm the VP of research at the mentor. I'd like to start by first thanking, the folks at siglap for the opportunity to present today at the summer, my trucks going to focus on, probably the single biggest problem. That's facing. Deep learning and AI systems. Today in particular. I want to talk about the issue of enormous Computing requirements that are imposed by today's departing that words and one possible solution to it named laid by introducing sparsity to do that work. Contacts to American be called. Demento where a small
research lab located in the San Francisco Bay area. We were founded back in 2005 to look at North science. Inspired approaches to learning and we have a two-part mission for the first part of our mission is actually a scientific Mission. We look at the neocortex. We study the brain. We try to take the experimental literature and creative. So functionally accurate theories of the brain is a biological theories that we publish in. There are signs of the patients are the second part of our mission is to actually take what we've learned from the north side, take the mule cortical, principal
identify them to a i, in a particular. We're looking to see how we can improve current AI techniques, So what is the problem? We're looking at today, as many of, you know, there's been an exponential increases in the resources that are required by AI systems for the last 2 years out. Today, gpus are the workhorses of knee-high. The best cheap used in two, hundreds of trillions of arithmetic operations per second. But even that's not enough to keep up with the tremendous Demands, a b. A i and it's sort of startling that
every few months. That's far faster than Moore's law. In 2008 birth, a state-of-the-art Transformer Network cost about $6,000 to train a last year. The evolution of Transformers 253 cost over ten million dollars to train ins along with. That carbon footprint is also exploding amount of energy, that's required to train some of these networks, even a single instance of this network, could power a small town or Village. On the chart of the right there shows that a hardware Innovations. Just cannot keep
Pace with that. The red curve shows the constraints that I posed by Deep learning and the computer requirements, the person that over the last few years and the black curve shows, the pay stubs Hardware innovations that this acceleration of Hardware performance. And as you can see is a big disconnect there. This is a massive problem with it with ER today. So we look at the neuroscience and there's one aspect of the neocortex, that's particularly relevant to that problem and that's sparsity. So what do I mean by that? So this red square that you're feeling those
recordings of neurons that are active during and wallet a mammal or animal is performance impressed with his black square is actually filled with neurons. You can't see them, but whenever there is a white dot flashing, that's a neuron, that's becoming active. And if you can see, only a tiny percentage of the neurons are actually Aunt, until your brain is actually behaving. Exactly like this with incredibly sparse firing, as you're listening to my doc. Ask until a more detailed. There's two aspects of sparsity that I'm going to focus on today. The first is called the activation
sparsity. So they are we know from North science that about 5% to 2% of neurons are active at any point in time. This is an incredibly small number. This is far smaller than the number of neurons that are active in the Deep Learning System. That in addition to activation sparsity. There's also something called connectivity state or city. And so, when you look at groups of neurons that projected one, another, the percentage of neurons are actually connected is incredibly Spa, show me about 125 per cent of possible connections actually exist. Now, there's a lot more to Star City in
the brain, the many, many dimensions of this. And I'm not going to go into that and this presentation, if you look at the last bullet, are you see that all of this sparsity? Actually least it incredibly efficient usage of energy in our brain or brain actually only uses up about 20 or 30 watts of power. Has less than a light bulb. The question is, can we take advantage of every Creative? Learning Systems that have some of these properties and then you sent address on this billing issue is that I passed earlier. So we do it. We create really Sparks networks
on the right. Is that a typical kind of depth at work that we create? And as you can see, the connections of connectivity Matrix between they're so sparse compared to standard deep Learning Network, which are showing at the left. In addition, the activation than the activity of the neurons are also sparse, just like in the brain. Not today. I'm not going to go into the details of the algorithms of how we create these networks, but the paper I reference there in 2019. I'll just go into that. How about one aspect of that sell mentioned that there is both sparse
activations and sparse weights. And it turns out that if you look at that from a stealing stand, when there's some really interesting properties on the left. I'm showing a tiny neural network. We have sparse activations coming in and the sparse weight Matrix, and in the process of computing, the output Computing, this a matrix products. And it turns out that if you have any zeros in the activation mathematically, you can skip the corresponding row of the weight Matrix. And if there are any zeros in The Matrix, you can skip that product as well. So there is a two-sided benefit
of zeros and either the activation or the weight be two products that can be completely us, kept our expectations that can be completed. This chart can outlines the potential benefits of berries are there is a multiplicative benefits. So be considered white sparsity. For example, suppose you can get tonight for some that are 90% Sparks. Only one out of ten weights are actually done zero. You can imagine such as skipping about 90% of the computation. So you'll need to do one tenth of the products. However, if you can get the 90% activation sparsity,
you could you only need to look at 1% of the nation's. Only 1% will have non zeros on both sides of the product that I can get anywhere. Close to what we have in the brain was 95 to 98%, white sparsity inactivation sparsity. You can see that there is a massive opportunity to exploit this combination between the two How to do this turns out it's not as easy as I just said, sparse models are very challenging to accelerate on hardware and very challenging to create. Now. We know that dense networks are handled very
efficiently by gpus and CPUs. On the other hand are quite challenging Simple, Man on Hardware in order to take advantage of this up. I mentioned earlier. I first operations require efficiently, skipping the elements that are zero and a mini architecture is the overhead of just figuring out where this 90's R&B song over heads out where the computational savings are. We've been looking at a hardware architecture, a couple fpgas, which allow you to set up custom create the specific circuit. And he's fpga has a pretty well suited to their regular
conversations that are required. Now that's just the hardware side from an algorithm standpoint. The training site is also quite challenging a stickler. It's really hard to train, sparse networks that are sparse and after I put some of the simple techniques that are there in literature, don't always feel well too, large data sets and it turns out that the optimal hyperparameters, you need for training. These not worth quite different than what you typically use for dense Network and press that swears to god, dammit. Don't want to talk about first,
this particular day decide, which is better. Google speech command status at this is a data set of spoken one-word commands. I spoken by thousands of individuals that was released by Google a few years ago, a state-of-the-art accuracy. On this speech data set is between 95 and 97 and a half percent, more than 10 categories. And this table here, shows a dense Network that was strange. That does convolution Network that was trained on this system, as well as a sparse blood. And as you can see, we're able to Trane Parts. Convolutional networks that have especially the same accuracy
at the dense. Popular social network hear the number of non zero waste in the sparse network is about one-tenth the number of weights and there is a chance that were so we're able to get to over 90% sparsity here while maintaining accuracy. I mentioned that the hyperparameters are quite different and so how are we able to actually figure out what they were here? I'm so what we did is we use the multi metric feature of sick off here. And in our case, we need to balance two different metrics. We want to get high accuracy as well as high sparsity,
multi metrics, adopt experiment, or we had a separate threshold for each of these two metrics and I'm taking a screenshot of the pickup dashboard on the right there to see that we have a minimum threshold of 96.5% per accuracy. That's why I was able to us and a minimum density threshold up .15. So we want to be at least 85% sparse there. In our case. We there were 10 hyperparameters that we were searching and we ran about a thousand different experiments Thousand Trails here. The Peril of coordinates chart of cigar shown here is another Snapchat
from the dashboard uncovered. Some really interesting patterns among the type of parameters, a success Agency for some hyperparameter. Is there definitely clusters in regions where it's far better than others and we were able to gain insights from this and use that for a lot of our training. I miss you could see from This screen there was actually that the number of trials that actually met book of our criteria was a tiny steps that it's a Time solution said within this large 10 dimensional hyperspace. In fact, among the Thousand files.
Only four trials actually met our criteria of stick up was able to locate the trials in this. A pretty small set of the Krampus face in the Basin optimization techniques. Actually work really well for us here. That is how we got to this level of accuracy. And this table sparsity. We did a few more experiments with different variations who wanted to see if smaller dense that works for the I'll work well and it's you can see here if we try to reduce the size of the overall dense Network, accuracy starts to drop interesting way. We were actually even be able to
create and even sparse in that work by increasing the size of the sparse Network. This is counterintuitive, but it turns out that if you increase the number of neurons in your overall Network, you can actually decrease the number of a nonzero weights that are required. There's some really interesting mathematical properties, that comes to play the rationally, even able to create a network. That's about 96% sparse here. How did the, how did this lead to the scalability and performance improvements that we looked at how we might build accelerate these networks on,
in this case were looking at a network. That's about 95% sparse. We tried a few cpu-based techniques as shown here and the best ones were able to get about three x performance Improvement on CPS. Now, that's not bad but it's a far cry from the 20x that you might imagine just from get up weights Varsity alone with 95% sparsity. Only about 10 to 20 weights are not 0.3. X Improvement is actually nowhere near what you hope to get gpus are typically limited to about a 50% Improvement or a 2X Improvement at most it is much smaller than what you'd expect. This
doesn't even take into account Activation sparsity. So what we are able to do is to implement a Technique, we call complimentary sparsity on fpgas to Cover. Some of these performance gains and compress them into a structure so that he can use tents operations on them. And what I'm saying, here is a cartoon illustration of this. On the left. We have white kernels that are about 80 per-cent sparse. You can see how you can take five of these overlay than one, Atop The other and create a single. Since Matrix at this point, you can do a dense multiplication
operation with your activation vector and then using fpgas. You can then create individual. A Dollar Trees that will do the appropriate summations to get the five different results that we want out. The great thing about this technique is that speed up. Actually scales linearly with the degree of sparsity. If I have 90% sparsity. I can overlay 10 of these weight matrices together and get about a 10x improvement over all. However, it does require. Do not overlap with one. Another has
additional constraints on our first networks, through the propane. Hyperparameter searches were actually able to recover the accuracy, even with these strange is another bonus to doing this weekend, actually, no apply sparse activations to this as well. Since we've converted the weight matrices into specially dance weight matrices. We can now look at these are sparse activations, come in as they come in. We know the indices of the sparse activations on fpj's and we can use that. You only pull up the weights values that are actually required. And so this
is what really allows us to unlock the power of The Spar Spar some objections. Okay. So how did this work on that? PJ's? So we tested the performance of these networks on two different silence chips. So the l v o u 250 is a large platform that's designed for data center applications. So it's got to be over 1.7 million, Dodge Excel to stop quite a bit of memory on on the card. And so on the right hand side, the view 3-g as the opposite. It's a tiny ship that's designed for Edge situations. As you can see, the number of logic cells and so on are much smaller there.
It's a much smaller system really for embedded iot types of applications. And in the results of show you the red three different types of networks that we are limited on their one was a pure dense Network. And we use violence has optimized. Keep learning tool, set two to run that and we had two different flavors of sparse. Networks. One is what we call sparse tents and that just leopard sparse weight without looking at the park city of activation. And then we implemented Spar Spar. Stat were a witch tries to take advantage of book. Barse, weights, and sparse activation.
The results were quite encouraging. You can see here and showing the performance of a single that work on that platform. On the audio. You 250 the dense Network could process about 3000 words per second. The sparse dense network was about $35,000 worth per second for about 11 times faster. The Sparks Sparks. Net worth was over thirty three times faster on the throughput standpoint than the dance that work the week. We got an additional 3x benefits from adding sparse activations into the mix. What's really interesting is
if when you look at the small and bedded pipe on the DU 3 Gigi, the dense network doesn't even fit on that platform because it's just not enough memory there. However, we can fit this farce. That works on there because they're many fewer weights on there. You can see that. The sparse networks are significant be faster. If even then the single dense that work on that I'll give you 250. I'm calling this an internet speed up because that's the worst don't even fit on that wall. I met a chip. That was a single Network. One of the nice benefits of this technique. Also is that
because the networks are smaller. You can actually fit more of these, that works on the chip. So, if you look at the full check, we can fit four instances of the dance that work on the ship and that leads to a full trip Trooper about 12,000 word for S. The sparse parts that were they able to get over a hundred times the performance, the best? But 112 signs of true. Put up a dense Network on when you look at full check performance. I'm in again with the embedded system. We're only able to fit one copy of the of the sparse networks on there. So the numbers are the same, but it's
interesting that even a single sparse Network on the embedded chip is about three to four times faster than four of these networks running on that data, center class. And what it shows is that what sparsity you can sleep? So you don't, you not only get performance benefits, but you're able to get open up new applications that you couldn't do ever before. Are you got the corresponding benefits with energy efficiencies? As well? When you're using Sparks Sparks networks, you can get over a hundred times better energy usage. Do you have it to two orders of magnitude reduction in
energy usage for this network? And is one of those rare cases where increase in performance actually does not lead to an increase in energy, usage performance increase. We're getting our from algorithmic changes. Not from improvements in the core Hardware itself. So it's a win-win situation from that standpoint. The question is, does it scale that she s e? Networks is a relatively small Network these days but about 1.7 million parameters, and we wanted to see if we could showcase some of these are much larger networks such as that, that works at an image that are
some of the larger Transformer that work that we don't have Hardware results to show yet. But we have been focusing on training aspects of it. As I mention. It's difficult to train, a spark networks to be accurate. The challenges are even more evidence with such large that work. We need to do a pretty Saro hyperparameter search, but the cost increase dramatically when you get to the sergeant that works and an addition, some of the intuition that you build when training smaller networks. Don't always apply to these larger Networks. The one common strategy that's often used as you
optimize your hyperparameters, using a small that work and then then use those type of parameters to train a large network. Is it cost-effective because you're mostly training, spell that works, but unfortunately, it does not always work. And in particular with sparse, that works, we found this strategy does not work the parameters that you got from finding a spot at work. Did not always translate to a large Network. So what we did is we used pickups multitask optimization feature and what you can do. There is set up a number of different tasks and pick up will
allow you to associate with each of these tasks. And then we'll try to optimize how many you know, which tasks are run with the goal of reducing, the overall cost of running these things. To our strategy was to always run a large network, but with a very number training stuff so you can see on the right, another screenshot. On their dashboard, I set up at 3 task system saved week, that works at the cost of .25 xr6t happens between the forecast. Raining in balancing between
these two different tasks were able to actually get state-of-the-art accuracy in a cost-effective manner. And again, what we found is that the perimeters for sparse that work for pretty different again for guest Network that this was definitely a worthwhile thing to do is charge kind of shows that going to help us work. Then didn't pick up the great. Show. The runs that were at the lower-cost asked and the blue circles a show of the trials that were run at the higher cost to build a campfire, a receipt. You can see that most of the runs we're done at the
lower-cost scenarios. And then once it starts to hone in on some parameter regime, sagapo automatically in both the higher cost. Again, this allowed us to achieve state-of-the-art Christine SIM card that works in a very cost-effective manner. I'm so this if somewhat busy table shows of the full spectrum of this person, that works at that. We've been training so far. So the top two rows show the GSC Network that I showed earlier. The middle to Rose shows results on imagenet using resnet50, were actually able to get us
77.1%, top 10, Chrissy, I using 75% sparse weights. So this is about 1/4. The number of crime writers of the full. I'm at work cities are state-of-the-art results for this leppla sparsity. So the 75% number was actually imposed by the particular, Hardware information implementation that were looking at it's actually possible to go even higher and still maintain. Hope that accuracy Bottom three rows shows are much larger results with these large Transformer that works and what were able to get two. If you look at the bottom row, that were able to
get to Transformers that are 90% Sparks. The one tenth, the number of waste as the full of that work with very similar accuracies, the accuracy on the glue Benchmark for the dense Network. In this case, was about 76.6% and we're able to get to 76.3% accuracy with 90% sparsity. And this is roughly in the range of noise. So where we do see a small accuracy head for this. It's not as significant. As I mentioned. We have Hardware implementations of these and progress. We've fully believe that the complementary sparsity techniques that I described earlier will easily
scale and translate to these networks. Although we don't have exact speed results to share at this point. Okay, so to summarize at the beginning, I talked about that a spark City and see both activation sparsity and connectivity. Sparks connectivity in the neocortex. I discuss the potential benefit. The massive by performance games, that are possible. If we can exploit are sparse. It makes know that sparsity in general is not well exploited. In today's Hardware platforms, actually quite challenging to create Sparks networks that are both accurate
and that can be accelerated on hardware. And what were you able to show? Today's some promising techniques? We see a path to really leveraging. The. Promise of Spar Spar star that works. And today showed that PJ and permutation that demonstrates the feasibility of this and we can see about more than two orders of magnitude performance improvements in Energy, Efficiency improvements, compared with that of dense network of this to actually do this properly, really requires the development of really small. Are straining. Algorithms really understanding the parameter space and all of the
different issues that can arise. When my training really accurate sparse Networks. So that's it for my talk. Thank you for your attention. If you have any questions, please feel free to email me. I can handle questions about life session right after this talk as well, and feel free to follow us on Twitter. At those two handles that are noted there. And then if you go to numenta.com papers, all of the research papers that we published, including the ones that's varsity, as well as other areas of the brain that we are. We focus on are available.
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.