TensorFlow World 2019
October 31, 2019, Santa Clara, USA
TensorFlow World 2019
Video
TensorFlow model optimization: Quantization and pruning (TF World '19)
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
6.84 K
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speaker

Raziel Alvarez
Software Engineer at Google

Lead of TensorFlow model optimization (https://www.tensorflow.org/model_optimization) and Co-founder and TL of TensorFlow Lite. R&D of Speech technologies @Google.

View the profile

About the talk

Come here to learn from our TensorFlow performance experts who will cover topics including optimization, quantization, benchmarking, and more. We will discuss both best current-practices and future directions in core technology.

Presented by: Raziel Alvarez

Share

Hi, my name is Rose Hill transfer station. And today I will talk about or 2K and in particular about techniques around quantitation and neural connections. So far so introduce why why do you say station? What is over two kids and then why was it important? What we're investing in this in this area, then I'll cover the tools that we have available and at the end it will give an overview a quick overview and turmoil in the longer-term and comfort at the end of the presentation with you have some minutes to go over a Q&A.

Too or to get implements techniques that should allow you to optimize machine learning models for deployment and execution. This is important because machine learning is important field and we think that there is a lot of room to make it more efficient and this is what was the economy like you can make all these applications much better Ride Like the quality or cheaper to execute. What is the weekend in Naples home new models a new deployment the new product idea where whites are not possible even if you just try to do your molars on

servers. So currency machine learning wrong with either on the server or on the edge on the thing that you know, there is a lot of capacity that is a lot of computer the memories of what is the benefit of optimizing these models with applications still is a Fil A very important metric for a lot of applications or you want to improve the Troopers write how many tasks can run on your server and these two are also directly correlated to Munich, right? So everybody will

want to save money and potentially worth a lot of money. Now on the edge is a little bit more obvious. Why when you have a conversation, right? It's a very resource constraint environment. Even if you are talking about applications in general we need to deal with you. No reduce memory a computer power consumption is typically an issue downloading photos from the cloud or even within the Chiefs to be able to transfer prime interest from memory to the processor. This could be a problem is the smallest to largest.

We have a wide variety of Hardware more than in the server and we need to make sure that these models run efficiently and on all these different types of hardware. So it follows that if we're optimizing this models, then we can have better morals and then eventually start translating in enabling new products that otherwise wouldn't exist. If we just were running these models on on a server. And these opportunities is larger than just you know, it smartphones

machine learning is trickling down into more work environment without a machine learning model. For example are used to detect failures in Machinery in factories, or we use it in self-driving cars. Will you sit in the office to scan documents and try to understand them? And just to give you some some numbers right like that. The size of the smartphone market is is really a fraction of the of the potential for the h devices in general. So basically the two reasons is we want to make Machinery more efficient. It's already very

important for the servers, please spray crucial for embedded devices. So we started this together weather year ago. We initially launched post-training quantization with this hybrid type of conversation and I'll go in more detail later in the presentation. Then the earlier this year. We launched API for neural connection pruning, then we created this is specification of quantized operations into your contacts operations for tensorflow Lite and we launch show supposed to rain in conversation support targeting these

specification. More recently, we added support for reduced float precision and hopefully soon we're going to be launching conversation at work training API and also adding super to DIA flight for sparse computation. So now we go into this igneous and these tools in in a little bit more detail. If we have some basic understanding of what is condensation and why is hard and why we are approaching or or to the way that we're doing it. So let's start with a simple example, you know Majors X. It's a basic operation for for machine learning models. You have two majors has to 10

a.m. Be in danger of losing a relation you get a 1330 right there a sensor dangerous. Are you a reminder about how much is X work? So which one of these results of the 10:30 or Computer Applications and accumulations? So if we look at one of them? Then and then you rethink our training these mothers typically in a higher position. Let's say Pro 32, then it follows that show 32 importation and then the product will be also a float 32, right? And then the accumulation will be also filter. This is fairly straightforward. There is some lotion position, but you know machine learning is pretty

good at dealing with a list of this level of precision. So, so no problem. Now what does it have to do with condensation Let It Go by to what were the goals for optimization tried? We want to be able to address of these restrictions. And also we want to be able to deploy in and watch Hardware so that we operating right. Let's say for a pimple go from the 32-bit flows to a wedding 2 years, and then let's operate entirely within Europe region. And this will be good because then

you know, we are coming from 32 weeks to 8 bit. So where do the memory you know, the mothers are four times smaller than the operations are typically faster to execute. There are also consume less power. And then because the parameters and also Dynamic values activations are smaller than we reduce Bandit pressure is needed in the face in the chips. Then there is more room for things to floor room, which can also translated and Foster computers reduce power. And then either your operations going to be a fairly common denominator across Hardware price

if he uses, please if I need to use it is important to your operations. So, okay, so please hurry. We're going to redo the position. So how do we convert 32 bit float to a wedding but we do something right. Now. We do something very simple price of wheat, how did linear mapping where we say? Okay with David values from a tensor where you compute the minimum and maximum value and then Basin. We spread them evenly on the ABS bench. This is very simple, right? So is that all that we need to do is not

that simple as well go back to the camper. So we have the Matrix X devalues right under ADHD in 2 years so they open this in the x or is it in 2 years? And then the multiplication the products now you need 16 bits of represented and then you need to accumulate and you probably want to 32 weeks to accumulate on that. Right? So what is the prominent World Proto Rize now you have tennis receive full of their interview. And then when you want to feed that into a normative multiply that you really want to execute he's

so what do you do sound waves within two years? So then you can feed them to your major league beats Matrix multiply in. Why are the decadence of all these prices for so we made it we're changing the starting value fry the parameters. They wait we are changing used to because you know, it's killing them, and I seen them on the Fly and station right in this case, very simple example of scaling operation. Nobody can buy you can be so you can say okay. Well some math is a little bit more complicated than

normal and more complex. It's an example of Warrior 200 Supply didn't I have rules of Uno operated weight scale prices for scalp usually end up in search of a black hole weigh scales cancel each other and basically just things don't work if you just go naive about it. And then it's it's more complicated decisions about how you're going to pretend. He's computation into your form June station. We want to be efficient or low pressure lower position is good. But we also want to be accurate

so it's not afraid of that you have to do. Then further complicating things is we have a genius Hardware right? There is hold different types of Hardware with different abilities right before an operation which Hardware support and also different preferences SIM card where he's Juno security operations a difference between dates for the service and we want to account for all those things when we are creating or quantized recipe or contact program. Then there is a fire gun machine learning, you know, it's hard to interpret.

We don't understand it. Right like we understand how it works. No to the level that we can have good proves to know that a transformation that we're doing to this model to this program will actually work or no result in a catastrophic error right. You don't want steak and then suddenly start giving you somewhere results. So is it makes it much more complicated to transformations? And then finally ours is more complicated than 20 station is Define. You might also need some extra

day. So freaking the Temple of the Matrix multiply write we needed to compute the minimum and maximum values of the of the dynamic activations fried and Dad's only can be known if we were on inferences through the model rise, which means that you need to provide some representative. They are basically just another hurdle that you have to account when one dies in this program. So, you know basically this music when we talk about conversation, we're really talking about rewriting transforming this machine learning program to an approximate

representation based on the operations that you have available. Send out how we're addressing these in an hour to kid was the first thing that we decided to do was try to scope down the program and say we're going to find this pacifications the four four common operations, like in these cases of dire ahamkara convolution Define indication Behavior, right? So we know that now with these information these low-level information that is relevant to condition than

hard work on target dosage specifications and then or to scan Target. Specification, and then we can all work this level and then we also get the other benefit from the use of point of view. You can come by tomorrow, and then this model can run in different Hardware without any without any change So right now in order to give me support three different Foundation types including here with you. We just typically Go from Frozen me to do flow 16 parameters and computation. So that's pretty straightforward. The next one is over Highway 20 station, which basically pictures of a Visa

for the barometer's biases and activations will leave a 32-bit float and then we start to be a smart as possible for how we execute this program. So the goal being that person operations like the big Nightwing X are left in the Interiors of mine and then we use floating point for things like activation functions. So it's it's a nice trade-off between accuracy and performance. Then the third one is integer conversation. Disney's everything is in two years. All the parameters are in 2

years and all the operations are in two years. This is obviously the more complicated one. So the benefits of the ReUse floats is your mother's house now outside and then they support your my guts are mosquitoes and then decorate that your ass is going to be very minimal pretty much always work. I haven't seen myself at least an actual model that doesn't work in the Western Union Pro 32 How to rebuild a station that features for their you know, a 4X reducing the size and then depending on their operations are using a different performance Improvement plan

to be a lawyer for in a fully connected Model S or R and ends and then the third one is the only other thing benefits of hybrid in terms of memory size, but is faster to ticket and he has a great advantage that he has more hard work over to use. Some of them are only like into your based like over Edge TPU. Let's talk about the tools to wax eloquent as motivation. Conservation type. So we have two types of tools one that works for training. So it works for the regular train model and the other one that is a working progress. He's doing training. So let's talk about the post training. The

process is very simple. You. You just have a train how you train it to you tomorrow then currency converter. If you just converted small the dance floor light and quantized on the flight and then you just have a mother that you can execute or whatever Hardware is supporting documentation type. So now let's look at this specific use a default TV stations and then the times that are targeting is flow 16, and then basically this will take care of changing all the parameters

and the computation and again depending on the hardware that you are running this model. You might get speed up right now for example gpus support flow 1680 play so you might get some speed up there. Either because of the congregation or even just because of the bond rates in YouTube will be reduced. I can head benefits size goes to half. And you know, they're decorative dropper is is very minimal is I will stay within the noise. Then the next one is or having a conversation. So I got to see this is very you see you just said the flag. Now, this is the default or

default. And then again, it will make sure the Pond ice holy parameters and then operations that doesn't have Define specification for State Farm. They will it will be kept in the original position and then you will get some speedups and you will be able to talk to Healing whatever Hardware complies with a specification. So typically Devon works pretty well for CPUs. And I got benefits 4X compression for for the models and then you get some speed of Soul Eater combination base model. Tell us why the speedometer is not speak.

That was one year old number so probably right now is faster. And the same for accuracy security is pretty good. And actually that we're working on some changes to for convolution models. It will even be able more accurate soon. Then the third one is the end of your conversation, right? So this one is the one that is more complex because now you need to provide some data, right? So you say okay, but I want to use either going to station. So now you need to provide Someday by data

mean on label samples of what your neural network will typically see in reality price pricing model. You need to feed the song pre-process images that works pretty well. So these are some results from post anybody station across different models SUV for for the majority of the Lost His not that big of a respected the foot position train Baseline the only one so that is a bit more meaningful drug-free well with bus training quantitation. Now I'll talk about during

training because I showing the previous results, you know, there is to do tomorrow that will benefit from doing this training a conversation at work training and Education and Training with means we try to emulate the condensation evaporation supplementation losses during the price of a neural network with the help of the parameters will be tuned to account for that. So the process for doing the conversation work training for using our / API. It's every morning beloved. We are again trying to make it very simple.

So we build it IPI in Charis. I going to make it very easy to you. So basically we assume that you already have a current model and then you just need to call or API to apply the conversation and did my change a little bit but it will look something like this. So you just have a model that you already built and why not and then the only thing that you need to do is call the right on your mother left and you get us now More Than Ezra breathing to have all the emulation of condensation. And then just called your feet

function and that's it. So then you just turn your Model S usual and then then you can go through the door light converter and then it will take this model string. We went decision. It will have all the date on This Is How We Do contacted and then it will produce a country that now just like supposed to rain tomorrow you will be able to execute and different Hardware. These are some numbers from quantitation work training preliminary numbers. If you see the the Delta is a little bit better than puss train

station is not a very big difference except for the mobile SSD. So before it was 4% for posturing when decision in this case is 2.9% in the world training. Is it still a useful tool. We're building it. Now you may Wonder those where a lot of Plantation Tyson to so which one should I use? So my recommendation is if you are just starting just start with driver float does the first one to try issues very easy to use that doesn't require any data accuracy will probably be the same and then

latency, you know the pain in the car where you might get some benefits reduce latency and then compatibility basically everywhere. You can execute floating Point operations. You will be able to use it. The next thing to try it will be the high recommendation. Then there is no data requirements. Security will be still good. Probably not as good as flow 16 in some cases, but it's still good. It will be faster than the reduce bloat and basically compatibility will everywhere that you have support for float and integer operations.

Then the third one to try is a in the air condition with a poster into to this one because you need to provide a little bit of the movie Tyrese, but the latest hit it will be the fastest and then it will also give you more Hardware coverage. And then the last things you probably will be the end of your conversation with our conversation over training. This will be there a little bit more. You're doing training and you're supposed to have now training set of a training is Craven with accuracy will be

better than than doing Joseph post training version. Can you get the benefits of being the fastest one and the one with more Hardware coverage? So that was fun vacation again only stores her we're trying to make it very easy to use. I will be ready for you to try them out and give you some feedback then connection pruning. So what is neural connection pruning? Well, the way that we have implemented so far in this is a training technique that during the training process. You would have thought removing dropping connections from the

neural network. And then these connections will the drug connections basically just become Sears in in the tennis rest of your training and then daniesha you end up with the Stars dancers. I was forced into this are the gravediggers you can compression and potentially could use in Faster to this an example of this is a random be neutralized. The dark values means values are nonzero and white music training progresses and it starts becoming Spartan run and if you see the Centuries by removing most of the parameters,

The present for labia is very similar to their conversation again. We're trying to bring some consistency to write the eyes. So it's building car has to be the same without you have a photo that is treatable in Paris. And then you're going to call the or API to apply to the running light and we are trying to make this as simple as possible. So the only thing that you need to find is bringing schedule basically when you want to start dropping these connections and until when and how fast how aggressive you want to destroy things you two to be and then you just called over

profunction which again will I will notify your graph to autodip running operations internally and then you just called your feet function and the train as usual. So basically you train as usual and then once you train, you have two options now or soon you will have two options. You can just take the same all of the Dance Floor tomorrow. You can judge compressor and then the Molly will be smaller. And soon you will be able to converted via tensorflow Lite and you will get a reduction

in size and potentially so Misty. Depending on what brand configuration you're using and the hardware that you are targeting. So this should be no reason now, what are the benefits of running with driving a lot of talk like really a lot of emotions beach house you and he were pretty well and do not like a lot of things hyperparameter tuning and and you're not careful restarting your mother's things like that has worked pretty well without a lot of babysitting

then he has potatoes for Speed of depending on Hardware support. And we also could have really good results. Like we can make a lot of the parameters basically go away. We see 50 to 90% with negligible. And other great thing is that he works. Well, those are we going to station so typical set of the week Friday for training pruning and then were you supposed to rain going to station and physically the currencies is pretty good to get a compound or anything before techniques. This is how we

learn our resources that we have when we launched this musician Inception V3, you see if we can get all the way almost to 90% sparsity with relatively small accuracy losses and GMT neural machine translation for I can we can take it to almost 90% pretty small accuracy losses and we've done is done this for Venetian how to reset the Google pixel event where the speech recognition models that are used in conversation and we're able to have a Mother we'd Cerberus high-quality

on running on a phone. Okay, so now I'll find out how to cover up real quick or run. Like I mentioned one decision. We're working on condition training API. So that should be ready soon. And we are also working on our specs for on Titan Eren then switch out a typically triggered your phone dies. Do I get some improvements? Do they have a conversation to be a more accurate for the goalie for convolution layers in support of Forest Park computation internship. Alright run time.

Longer-term are know if you have heard about a state-of-the-art infrastructure interesting to us because you know, it's it's a it's a better way for us to ride these Transformations. And like I said at the beginning of taking the motorway transfer money into another one program internal representations. Program and some of the things that we were unable is better targeted Hardware. So, where is this location? Because users Target hours this equation and

executing different Hardware. Show me your sources. Just want to see routing one hardware and get the best out of it. So we're helping with the new reporter for the Web building on top of a male artists should be possible. And finally, I really just want to go ahead and try it out and give us a bike what techniques you would like to see you in a research is there is thickening pooping all over the place in a little bit worried that we have to go through his calling was what's useful in West now is what is generally what is very specific. So we would love to hear your

feedback about the two that were already have during the 20th. We're trying to make them as easy as possible to use. We know if we still have a long way to go and if you can provide will be really really appreciate it. And I think there is a little bit of time for questions if any of you have questions. Hi, thank you for the presentation have a question regarding the training with industry organizations in the pipeline. Is that going to be true quantization giving training or knows her right now the way I buy true I mean that you expect that all the operation happens in the end of

your domain Mabel because I want to make training faster as well. But right now the way that we are targeting is familiar with the unfortunately call 527 the basically these losses and we quantized parameters and then we Attack on Titan and then we do the Florida probation. So that's why we are using right now. But yeah longer than we want to go through in the Fort Worth pocket. Thank you. Got to go. I'll have this one question after you do the quantization. Is there a way that you can also visualize, you know, you know, the

finish line size model was one question and I have another question. Let me think about it. But can you is there a way that you can also the other question was what sort of tools are you going to provide as far as you like model correctness in stomach at least evaluate? You know, how do you know what sort of functionally correct in this is why we have artificial eyes are so you can see the quantum model. I don't know if we will give you a lot of information to Penny while you're looking for

we also want to make or feeling a bit better because house for whatever reason you want to get to know all 3:30 and start looking at the activation is and how they change. Eli's Bistro lights. Are you going to see how the graph change because in my experience? The only thing that really works is to really evaluate on the real later that you care to run your mother on, you know, like we tried to do things like Elsinore. Mr. Brooks and may OKC versus the food position 1 versus liquid ice wine

gas price, depending on the outer layers, you know easier to make a temporary regressions are much harder because now you really care about the actual value. So, thank you. I have a question about the results from the GMT training with induce Varsity. I was wondering if you had any insights on why the trading with $0.84 per City would perform better than the original version like if you wake up and you know, I seen the same with someone dies models. I've never had the year to really sit down and try to understand what the reasons are for all these sometimes he

is waiting. Know it all depends on your evaluation set Riley's if he's really not that big another meaningful than this. These jobs. Are are are all possible. Like I've seen some of those words and looks great after you contacted then just throwing a newer SSA from speech communication and noisier alternative and then you clearly see the difference between one and the other a lot of income videos noise you mentioned explainability technique could be like silly zeemaps. Do you have any insights on how these techniques

affect the ability to to calculate the gradient to calculate a couple more excited. They're trying to understand them and they were too old to understand neural networks that having approximated. But so far, I haven't gotten any luck trying to kill people and that's Isaac height of the volume. Like I don't have any meaningful thing to say because I haven't run many experiments are run. Hey, what's the best way to handle fragmentation of hardware store liquidation can depend on the rug Hardware & More open driving mode Android Sony types of headwear. So what are the

best practices their Partners happy? Because we would like to be able to Target their Hardware in the most precise and efficient way, but that's one way that we try to address it. Like if dinner weed or knowledge of something by the hardware is there and what is supporters would try to create this is pacification is going to try to accommodate for everybody which organ is as good as another ten times by then longer-term and I don't want to say too much because I really haven't I don't have a very concrete plan to share it with fire. We're building with Emily Erica structure

is we want to be able to do better Target. Hardware better partner with us to understand whether their hard work abilities and better created Transformations at Target. Hardware, but we were really Android app that syncs to three times. Right and it's because you know, do you know today is it's a very arbitrary boundaries and we say, oh, this is the reality, you know, we should be able to take advantage of mixing and matching operations to go to get something better. I

have a question about pruning as a general rule and layers of things operations are converted to Matrix multiply because of their efficiency. I was pruning you're now at passing an individual Moto X operations one by one there must be some crossover point at which you need to prune by 10:15 20% before you're crossing over it and actually get an improvement thoughts on where that is. So, I don't know if so, for example, we know those four for CPUs sing the instructions with the big Ali have registered letter can accommodate 16 values. We know that he want to speed up CPU

get to provide the said the setting to say I want to prune in blocks of a 1 by 16 and I hope we can get a speed of CPU for the temple and unfortunately right now. Be like Hardware dependent, but that's one thing that you can do it right now.

Cackle comments for the website

Buy this talk

Access to the talk “TensorFlow model optimization: Quantization and pruning (TF World '19)”
Available
In cart
Free
Free
Free
Free
Free
Free

Access to all the recordings of the event

Get access to all videos “TensorFlow World 2019”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “AI and Machine learning”?

You might be interested in videos from this event

March 11, 2020
Sunnyvale
30
205.62 K
dev, google, js, machine learning, ml, scaling, software , tensorflow, web

Similar talks

Pete Warden
Engineer at Google
+ 2 speakers
Nupur Garg
Software Engineer at Google
+ 2 speakers
Matthew DuPuy
Principal Software Engineer at Arm
+ 2 speakers
Available
In cart
Free
Free
Free
Free
Free
Free
Zak Stone
Product Manager for Cloud TPUs at Google
Available
In cart
Free
Free
Free
Free
Free
Free
Jeff Dean
Google Senior Fellow at Google
+ 5 speakers
Megan Kacholia
Engineering Director at Google
+ 5 speakers
Frederick Reiss
Chief Architect at IBM
+ 5 speakers
Theodore Summe
Head of Product for Cortex (Twitter Machine Learning) at Twitter
+ 5 speakers
Craig Wiley
ДолжностьDirector, Product Management CloudAI Platform at Google
+ 5 speakers
Kemal El Moujahid
Product Director, Tensor Flow at Google
+ 5 speakers
Available
In cart
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “TensorFlow model optimization: Quantization and pruning (TF World '19)”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
558 conferences
22059 speakers
8190 hours of content