TensorFlow World 2019
October 31, 2019, Santa Clara, USA
TensorFlow World 2019
Video
TensorFlow Lite: Solution for running ML on-device (TF World '19)
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
3.56 K
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speakers

Pete Warden
Engineer at Google
Nupur Garg
Software Engineer at Google
Matthew DuPuy
Principal Software Engineer at Arm

I am an experienced software engineer and instructor with a demonstrated history of working in the civic & social organization industry. I have a strong academic background having graduated with my M.S. in Computer Science from California Polytechnic State University, San Luis Obispo.

View the profile

17th American to summit K2, 5th to summit Annapurna, 2nd to summit Everest, K2, and Annapurna.Job Objectives: Principal Systems or Embedded Software Engineering positions involving system integration and engineering business development.Specialties: Embedded systems design and integration. Technical software team management. Team lead in high altitude and alpine mountain climbing and traditional rock climbing.

View the profile

About the talk

TensorFlow Lite is TensorFlow’s lightweight cross-platform solution for mobile and embedded devices.

It enables on-device machine learning inference with low latency, high performance, and a small binary size. It is the standard solution at Google and the primary inference framework for all on-device use-cases.

Presented by: Pete Warden, Nupur Garg

Share

Cancel action. I was used to learn how to dance on a mobile phone smartphone camera and turn into a powerful tool for analyzing body posts that have developed an advanced model for doing post segmentation through able to take their implementation converted into tenths Pro-Lite. Once I'm headed there we can use it directly. To run all the AI and machine learning models to detect body parts is a recomputation expensive process where we need to use the on device Jeep you made it possible so that we can leverage all these results for the computer on the device

and give a great user experience that involves movement would be a great candidate. So that means people who have skills can teach other people those skills and you know my eyes just as lieth it really just seeing if Isis between the two things when you and how a people to teach people, I think that's really when you have something that you noticed I'm changing. When Tim originally did this he did this in slow motion will uses models that running on device in order to speed up the his dance

performance to match. The professional dancer Wheels is not spotted a few Motion in order to understand what motions he was doing well and what he needed to improve on applications. Like this can be used on device for educational purposes for Noli dance other applications as well. New cutting-edge models are also pushing the boundaries of what's available on device. Brent is a method of training pre-training language representations, which retains state-of-the-art results on a wide array of natural language processing test. Today. We're launching mobile.

Birth has been completely re-architect in to not only be smaller but be faster without losing any accuracy. Running mobile Brightwood tensorflow Lite is 4.4 times faster on the CPU than Bert and 77% smaller while maintaining the same accuracy. Let's take an exam look at a demo application. So this is a question and answer example application that take Snippets from Wikipedia that has a user asked questions on a particular topic or suggest a few pre-selected questions to ask and then searches a

Text corpus for the answer to a question all on device. We encourage you to take a look at both of these damn application at our booth. So was worked hard to bring these features of dance like a mobile Birds to your applications by making it easy to run machine learning models on device. In order to deploy on the on device you first need to get a tensorflow lite model. Once you have the model, then you can load it into application transform the data in the way that the model requires then run the model and use the resulting output.

In order to get the model we have created Rich model Repository. We're about as many new models that can be utilized in your applications in production right now. These models include the basic model such as mobilenet and Inception Franklin mobile Birds style transfer and D-Block V3. Once you have your models, you can use our tensorflow Lite support library that we're also launching this week. It's a new library for processing and transforming data. Right. Now it's available for Android frame is modeled for working on

adding support for IOS as well as additional types of models. The support Library simplifies the pre-processing and post-processing Logic on Android this includes functions such as rotating an image 90 degrees or cropping the image. We're working on providing Auto generation apis that Target your specific model and provide apis that are simple for your model. Weather in this Alliance as I mentioned will be focused on Amazon use cases. However, we're working on expanding their use cases to a broader range of

models. So let's take a look at how this looks in code. So before I support library in order to add Pensacola light in your application you needed to do all of this code most doing data preprocessing in post-processing. However, it would the use of the auto generation support libraries lines of code. The first two lines are loading the model, then you can load your image bit data into the model and I'll Transform the image has required. Next you can run her

model. I know I'll put a map of the string labels labels with a float probabilities. This is how cold will look with auto generation apis. They will be launching later this year. What are the biggest frustrations with using models was not knowing the inputs and outputs of the models now model authors can use this does matter with your model have available from the start. This is an example of a Json file. That's a model model author can packaged into the model. This will be launched with auto-generated apis and all over models in the model. Garden will be

updated have this metadata. in order to make it easy to use all of the models in our model garden and lovers of TF support library with audit example applications for Android and iOS for all of the model and the applications uses have support Library wherever possible we're also continuing to build out a dental applications on both the rest for a pie and a tube you so now what if your use case wasn't covered either buy or model Garden or the support Library revisiting all the use cases. There's a ton of use cases

that we haven't talked about in the specific models that we listed. So the first thing you need to do is either find a model or Jenner at model yourself from tensorflow API API, which is the unified file format for 2.0. You can take some model passes through the 10 circle light converter and then you got a tentacle light flapper for model output. Encode it's actually very simple. You can send her your model save it was one line and use two lines of code to take in the same model and convert it.

We also have apis that all the details of those are available on our website. Over the last few months we've worked really hard on improving our converter. We've added a new converter which has better debug ability, including source file location identification. This means you can know exactly where in your code cannot be converted to see a slight. We've also added support for controls to a B-2 which is the default controls flow in 2.0. In addition, we're out a new operation

as well as support for new models including mask rcnn faster rcnn, mobile Bert and deepspeech V2. In order to enable this new feature. All you have to do is set the experimental you converter Flags to true. We encourage everyone to participate in the testing process with plants at this new converter as a default back end at some point in the future. So let's look at the debug ability of this new converter. So when running this model I guess the nerve that's here for support. Co-op is neither

a custom up Nora Flexall then it provides a stack Trace allowing you to understand where in the code this operation is called and that way, you know exactly what line to address. Once you have your say offline model, I can be integrated into our application the same way as before you have to load the model preprocessor it run it and use the resulting output. Let's take a look at her down version of this code in kotlin. So the first two lines you have to load the model and then you have to run it through our interpreter. Once you have go to the model, then you need to

initialize the Importer and the output during the input should be a bytebuffer and the output needs to contain all of the probabilities. So it's it's a General Flooring then you can run it to The Interpreter and do any post-processing as needed. To summarize these Concepts you have a converter to generate your model and The Interpreter to run your model. Ben Carpenter called into opt kernels and delegates, which I'll talk about in detail in a bit. I'm just what you can do all of this and a variety of language bindings. We've released a

number of new first-class language bindings including Swift and Objective C for iOS C sharp for Unity developers and see for Native developers on any platform. We've also seen a creation of a number of community own language bindings for rustgo and Darth. Now that was discussed how tensorflow light works at a high-level. Let's take a closer. Look under the hood. One of the first hurdles developers face when deploying models on devices performance. We work very hard and we're continuing to work hard on making this easy out of the box.

We worked on improvements on the CPU GPU and many custom Hardware as well as adding coolant to make it easy to improve your performance. So this is why I trust he has lights performance a Google IO and Mame. Since then we've had a significant performance Improvement across the board from floor models on the CPU tomatoes on the GPU. Just hurry. I'm besides how fast this is a flow model. Permobil, not V16 37 milliseconds to run on the CPU is that model? It takes

only 13 milliseconds on the CPU on the GPU of a model takes 6 milliseconds turn on the edge TPU in cuantas 6.2 milliseconds. Now, let's discuss some common techniques to improve the model performance. There's five main approaches in order to do this. Use quantization pruning loveridge Hardware accelerators use mobile optimized architectures and prop profiling. the first way to improve performance issues quantization quantization is a technique used to reduce the position of static parameters such as ways and dynamic values such as activation. Foremost model

of training an inference use for 32 however in many use cases using entity or float 16 instead of 4:30 to improve latency without a significant decrease accuracy. Using quantization enables many Hardware accelerators that only support 8-bit computations and addition and a lot of additional acceleration on the GPU which is able to do two foot 16 computations for one float 32 computation. We provide a variety of techniques for performing quantization as part of the model optimization to a kid

many of these techniques can be performed after training for ease-of-use. II techniques for improving model model performance is pruning. Drew model pruning we set a necessary date weight Valley 2-0 by doing this we're able to remove what we believe are unnecessary connections between layers of a neural network. This is done during the training process in order to allow the neural network to adapt to the changes. The resulting weight sensors will have a lot more zeros and therefore will increase the sparsity of the model. What's the diction of

sparse tensor representations? The memory bandwidth of the kernels can be reduced and faster kernels can be implemented for the CPU and custom hardware for those who are interested. Roseola be talking about pruning and quantization in-depth after lunch in the Great American Ballroom. revisiting architecture diagram more closely The Interpreter calls into operational intelligence. Kernels are highly optimized for the arm neon instruction set allow you to access accelerators such as a GPU DSP and its shape of you

So, let's see how that work. Delegates allow part or entire parts of the graph to execute on Specialized Hardware instead of the CPU. In some cases some operations may not be supported by the accelerator. So portions of their grass that can be offloaded for acceleration are delegated and remaining portion of the graph are run on the CPU. However, it's important to note that wasn't grass has delegated into too many components. Then I can slow down the graph execution in some cases.

The first delegate will discuss is a GPU down again, which enables faster execution for float models. It's up to seven times faster than the floating-point CPU implementations. Currently the GPU delegate uses opencl when possible or otherwise opengl on Android and uses metal on iOS. 1 Trailwood delegates is increased to the binary size the G. Elegant ads about 250k to the binary five. The next delegate as a Qualcomm hexagon DSP delegate in order to support a greater range of devices that especially mid to low tier devices work with Qualcomm to develop a delegate

for the hexagon chipset. We recommend using the hexagon Dell again on devices Android auto and below and the nnaap I will talk about next on devices Android p and above the stomach stops integer models and increases a banner size by about to megabytes and we'll be launching soon. Finally we have the enemy API Delegate for the neural network API. Eminem. Elegans supports over 30 off on my Android team and over 90 offs on Android Q Does delegate accepts both

float an integer models? And it's built into Android devices and therefore has no binary size increase. The code for all of the delegates is very similar. All you have to do is create the delegate and add it to the TS light options for The Interpreter when using it here is an example with the GPU delegate and here's an example with an API delegate. The next way to improve performance choose a suitable model with a suitable Mark model architecture for many and his classification test people generally use

Inception. However, when doing on device mobile night is 15 times faster and 9 times smaller. And therefore it's important to investigate the trade-off between the accuracy and model performance and inference and n52 other applications. That's why I'm Finally you want to ensure that your benchmarking and validating Oliver models. We offer simple tools to enable this for prop profiling which helps determine which apps are taking the most computation time besides Rosa way

to execute the proper following tools to the command line. This is what are two will output when you're doing Parappa falling for a model and enables you to narrow down your graph execution and go back in tune performance bottlenecks. Beyond performance we have a variety of techniques relating to Ops coverage. The first allows you to use utilize tensorflow Ops that are not natively supported into your flight and the second allows you to reduce or binary size if you only want to include a subset of all. So what are the main issues that users face when

converting a model from tensorflow to test fuel light is unsupported off? Pia flight has native implementations for a subset of the tensor flops that are optimized for Mobile in order to increase our coverage. We will leave at a future call tensorflow Lite Sol a child support for many of the tensorflow off. The one trade-off is that if it increase binary 5 by 6 megabytes because we were pulling in the full tensorflow runtime. This is a code snippet showing how you can use tensorflow Lite select. You have to set the target Speck. Supported Ops to include both built-in and black

tops. So Bill turn offs will be used when possible in order to utilize off and Ice kernels and Celeste also be used in all other cases. On the other hand Spirit Air flight Developers for deeply care about their binary footprint. We've added a technique that we call Selective registration, which only include stops that are required by the model. Let's take a look at how this works in code. You create a custom mob resolver that you use in place of TF lights built in operas Auburn and then in your build file, you specified your model and the custom offers over that you

created and see if Wyatt will scan over your model and create a registry of all contained within your model. When you build an interpreter only include the cops that are required by your model there for reducing your overall binary size. This technique is similar to the technique that's used to provide support for custom operations, which are user-provided implementations for Ops that we do not support as Belton Ops. Next we have Pete talkin about microcontrollers. I think they in Tampa Florida has had a lot of success in mobile devices like

Android and iOS on over 3 billion devices and forgot but Oh, I might actually have to switch it back to. Yeah, we got so. What is really interesting though? Is that were actually over 250 billion microcontrollers in the world already done? Because I tend to kind of hide in plain sight but these are things that you get in your cars and your washing machines in almost any piece of electronics these days they extremely small they only have maybe tens of KGB

of 1/2 actually work with Linux and they are incredibly resource constraint. You might think okay. I've only got tens of kilobytes of space want me to be able to do with it a classic example of using my controller is actually and you'll have to forgive me if anybody's phone goes off but OK Google that's a driven by a microcontroller that runs on a noise on DSP. I'm the reason is running on a DSP even though you have this very powerful kind of you sitting there is a DSP owner uses a tiny amounts of battery. And if you want your back me to lost more than an hour or so,

you don't want listen to you on all the time. You need something that's going to be able to sit down and sip almost no power. So the setup that we tend to use for that is you have a smooth comparatively low actress and model. So it's always running on this very low energy tsp. That's listening out for something that might sound a bit like OK Google and then if it's thinks it's her that it actually wakes up the main CPU, which is much more Factory hungry too and even more elaborate model to just double-check that so you're actually able to get this

Kaskade. Deep learning models are trying to text things are you interested in this is a really really common pattern even though you might not be able to do an incredibly accurate Model A microcontroller or DSP. If you actually have this kind of architecture, it's very possible to do really interesting and useful applications and keep your battery life actually alive. We needed a framework that would actually fit into this like tens of kilobytes of memory. We didn't want

to lose all of the advantages we get from being part of this tensorflow Lite ecosystem and this whole tensorflow ecosystem. So I interpret that fits within just a few kilobytes of memory, but there's still uses the same apis the same cuddles the same file before the eu's with regular tents for Lycamobile. So you got all of these monsters all of these wonderful tooling things that in the pool, which is talking about that are coming out thought you actually get to deploy on these really tiny devices.

Which we thought some people don't like the Privacy that side effects of that is you have to press the button a if I can pay here and then you speak into this microphone that I've just got plugged into the semaphore tier 2 standard microphone and it will display a video and animation and audio. So let's try it out when I press a and speaking. Yes. No. Dammit live demo, what were using and this is all Hardware that we have now battery powered. Yeah. This is an animation when she says the word

yet because it's recognized that actually an example of using tensorflow Lite for microcontrollers, which is able to recognize a simple words like yes or no and it's really a tutorial on how you can create something that's very similar to the OK Google model that we found on this page and phones to recognize, you know, short words or even do things like recognize, you know, if you want to recognize breaking glass if you want to recognize kind of any other odia noises, there's a complete tutorial you cannot sleep lab and then you can

deploy on these kind of my cook controllers. And if you're lucky and he stopped by the tensorflow Lite booth that we might even have a few of these microcontrollers laughed to give away from Aida food. So I know some of you out there in the audience already have that box, but thanks to the generosity of all we've actually been able to find some of those out so come by and check that out. So, let's see if I can actually yeah. Okay. I'm the other good thing about this is the

you can use this on a whole variety of different my cooking Solas. We have an official winner Library. So if you're using me on Twitter ID, you can actually grab it immediately Again, I easy my child. May I let's see. We'll have to slide available so you can grab them and you just choose it like you would any other Library if you're familiar with that, but you also have available through systems like and bad if you're used to that on the devices and a few places like Spock song on date of food. You can actually get bored. I want the

stars at you have to trust me cuz you won't be able to see the LED if I do a w gesture headlights on the red LED if I do a and it lights up the blue LED some of you in the class may be able to vouch for me and then if I do an l See if I get this light. it lights up the yellow LED if you can tell I'm not an expert with it, but okay, let's see if It might we might need to click on the app like Focus. See this fingers cost. Ye. Oh my God. So yeah, you can see here are doing a very nice video from them and they have some great examples out that to you can just pick up that

boiled and to get running in a few minutes. It's pretty cool and I thought I'd mention it with a magic wand during an Accel ignition. You can imagine that all sorts of applications for this and the key thing here is this is running on something that's running on a coin Mastery and can run on a coin battery for days or weeks or months if we got the power station by so this is really the key to this kind of ubiquitous ambient confusing. You might be hearing a lot about and what other things can you do with these kind of

condition great things like simple speech recognition, like we've shown we have a demo and Booth of doing person detection using a 250 KGB. Adjust the tax whether or not there's a puss in public, which is obviously super useful. Occasions. We also have Predictive maintenance which is really powerful application, you know, if you think about machines and factories, if even if you think about something about your own car, you can tell when it's making a funny noise and you might need to take it to the mechanic. Now if you imagine using machine learning models are all

of the billions of machines that are running in factories and Industry around the world. You can see how powerful that can actually be. So as we mentioned we've got these examples out there now as part of 10 slow light that you can run on Arduino spots on Adafruit list kinds of board recognizing. Yes with the ability to retry and using tensorflow for your own words. You care about Doing person detection is really interesting because we tried it for people. It will actually also work for a whole bunch of

other objects 7 the Coco dataset. So if you want send a text cause instead of people it's very very easy to just be targeted for that and gesture recognition been able to try and get to recognize these kinds of Jessica's obviously. If you have your own things you want to recognize two accelerometers that's totally possible to do as well. So one of the things that's really helped to do this has been a partnership with who making a little devices that we actually

designed all the devices. We got to mean trying to get people out from if you can just give away so people can find you and thank you for contributing a lot of code. I know this has been a fantastic partnership and stay tuned for lots more where that came from. I'm so that's it for the microcontroller just to finish up. I want to cover a little bit about where Pensacola is going in the future. So. What we hear more than anything is people want to bring more models to mobile and embedded devices. So

they want models to run faster. So we're continuing to push on performance Improvement. I want to see More integration with tensorflow and things like tensorflow table and easier usage of toiled on device training personalization is a really really interesting area where things are progressing and we also really care about trying to figure out where your performance is going and actually trying to automate the process of in a profiling optimization. I'm helping you do a better job with your models. How to help with a

little. We also have a brand new course that's launched on Udacity aimed at 10 slow light, so please check that out. So that's it from us. Thank you for your patience through all of the technical hiccups. I'm happy to answer any of your questions. I think we're going to be heading over to the booth off of this. So we will be that you can email off that she has light at 10 slow too old if you have anything you want to just about thank you so much. I'll look for chatting.

Cackle comments for the website

Buy this talk

Access to the talk “TensorFlow Lite: Solution for running ML on-device (TF World '19)”
Available
In cart
Free
Free
Free
Free
Free
Free

Access to all the recordings of the event

Get access to all videos “TensorFlow World 2019”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “AI and Machine learning”?

You might be interested in videos from this event

March 11, 2020
Sunnyvale
30
205.62 K
dev, google, js, machine learning, ml, scaling, software , tensorflow, web

Similar talks

Raziel Alvarez
Software Engineer at Google
Available
In cart
Free
Free
Free
Free
Free
Free
Kangyi Zhang
Software Engineer at Google
+ 3 speakers
Brijesh Krishnaswami
Software Engineering Manager at Google
+ 3 speakers
Joseph Paul Cohen
Postdoctoral Fellow at University of Montreal
+ 3 speakers
Jared Duke
Software Engineer at Google
+ 3 speakers
Available
In cart
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “TensorFlow Lite: Solution for running ML on-device (TF World '19)”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
558 conferences
22059 speakers
8190 hours of content