Markus Levy is president of EEMBC, which he founded in April 1997. As president, he manages the business, marketing, press relations, member logistics, and supervision of technical development. Levy is also president of the Multicore Association, which he co-founded in 2005. In addition, Levy chairs the IoT Developers Conference. He waspreviously founder and chairman of the Multicore Developers Conference, a senior analyst at In-Stat/MDR, and an editor at EDN magazine, focusing in both roles on processors for the embedded industry. Levy began his career in the semiconductor industry at Intel Corporation, where he served as both a senior applications engineer and customer training specialist for Intel's microprocessor and flash memory products. He is the co-author of Designing with Flash Memory, the one and only technical book on this subject, and received several patents while at Intel for his ideas related to flash memory architecture and usage as a disk drive alternative.View the profile
About the talk
Abstract: For machine learning applications, neural networks are typically processed using some form of runtime inference engine. Alternatively, this presentation will show you how to use a neural network compiler to manipulate models expressed in different formats, such as TensorFlow Lite or ONNX, to generate a self-contained inference library. For a given Arm®-based target machine, this library delivers a smaller memory footprint and significant software acceleration by utilizing the Arm® CMSIS-NN kernel library.
Presenters: Ciprian Mindru, Software Engineer, NXP Semiconductors, Markus Levy, Director of AI and Machine Learning Technologies, NXP Semiconductors
Target Audience: Software Developer
Topics: #ArtificialIntelligence, Consumer Electronics, Edge Compute, Industrial, #IoT, Microcontrollers, Software and Tools, Voice and Image Recognition, Open Source, #ArmDevSummit
Type: Sponsored Session
Conference Track: AI in the Real World: From Development to Deployment
Air Date: 2020-10-08
Hello everybody and welcome to our presentation on utilizing arm since this and end to optimize a neural network compiler. And we'll get into some of the technical details and some of the highlights of this technology so that you can really get an understanding of how this thing works. My name is Marcus lady. I'm the director of machine learning Technologies at Cologne dissimilar chief engineer earn XP in the book, red light, and I was part of the staff who work there on utilizing
a compiler. So before we begin on the details, let me just give you some of the motivation behind us. And you can see, on this chart here, it's doing a, a high-level Benchmark comparison, using a c410 benchmark, which is a relatively simple Benchmark, but it does allow us to highlight the fact that using the optimizations that your freon in his team did, were able to get almost a 2 x performance, increase over the out of the box version. And what I mean by out of the box here is that since Lilo
is, as the neural network, compiler is an open-source technology. You can easily download this from the get Hub or repository. And you can try it out in a very unoptimized manner, but at least allows you to see the function of operation of this. So that's what the 6.05 frames per second is, but by the time you add in the Simpson and back into optimized for the cortex-m CPU core, you get almost a 2 x performance increase. Do before we began on the global technical details and talking about this neural network compiler. Let me just give you a high-level overview of the
device that we used. As one of the targets. This particular diagram here is the RT 1060, and the key points to note. For this particular presentation is the CPU platform, which is a quart xm7 that runs up to 600 megahertz. We have new devices that are on the horizon, specifically, the RT 1170, which has a 1 gigahertz core on that. So you could imagine that you get a very significant performance increase, and this is just a man MCU, right? So, I think 1 gigahertz of performance is pretty pretty amazing for an MCA platform.
So also a little bit of the overview of NX peas, broad machine, learning Solutions, and decide will highlight some of the key functions that we unable to look at the, the first build out here. This is a do-it-yourselfer, so eiq, which is our machine learning support is all about, allowing you to bring your pre-trained machine learning models on to our platform and run using some form of open-source technology, such as tensorflow. Lite, Ormond in glow, Onyx, run time, Etc, We also engage very heavily with third-party software and Hardware vendors.
This is just a small list of the companies that we are involved with that provide complementary Technologies or aren't you enable. Meant we also have a group that is providing TurnKey Solutions and some of the examples you see here are for Alexa voice services, or local voice control solutions, that provide both a hardware and software combined solution that allows you to drop this into your platform. And lastly, there's another side of our business here that is very heavily focused on the automotive and they provide their own. Got you Auto Tool which provides optimizations
pruning techniques. Quantization at cetera and his provides a automotive quality inference engine that many of our Automotive customers require Let's get into a little bit on the neural network compiler itself. So first of all, this is extremely basic but this is what a very simple neural network looks like and as you probably are familiar already, it's comprised of these knows that are connected in various links. And basically what happens is you put the input in could be a image, could be a video frame, could be sound it processes. And at the output of this will
give you some level of confidence as to what that actual input was. So in the sense it's making a decision or some type of a guess on what it is. But the point here is that these nodes are all connected together to provide more of like what what might some people might consider to be like a brain like function. No looking at it, a different way. These nodes are all represented by layers and those layers represent various operator functions, such as convolution Revenue pooling fully connected layers. And then finally, on the output in typically would have liked a softmax
layer, which is providing the decision Factor. The point of this slide here is really dirty to it to demonstrate that a machine learning model is comprised of multiple layers. And typically, you know, you'd find somewhere between fifty and a hundred fifty or even more layers. At each layer is comprised of many millions of computation. So it's pretty heavy-duty processing to to do a machine learning model on a device, especially in MCU. Spell, you're getting into some of the glow fundamentals. First of all, let me point out here that is typical
processing fashion. A model is executed using a run time. In this case, it's demonstrating a tensorflow, Lite, Run time. Now, the point here, is that tension, full light and all run times, including our man in and on which one time our Dynamic processing engines. And so, it looks at each layer one at a time as its processing it. So it's a very Dynamic process. All of the work is happening during the run time part of it. So the CPU core not only is processing the runtime engine, it's also processing, all the operators computations. So you can see
that I'm using a run-time engine, like tentacle light, the processor, very busy here that you how gloworks glow and Chip Elmo get into more details on this glow allows you to do ahead of time compilation which is similar to any What type of compiler that you may be familiar with? So go takes the model during the development phase and actually goes through all the layers, and it looks like both Global and local optimization produce object code, which is then run as a sequence of computations on the platform. So there's no runtime component associated with it.
So, I will talk about a couple of methods for the point there, multiple ways to deploy model using one, we used to use my torch, which is an open source machine learning framework developed by Facebook, which is widely used in research and used it. Rain tomorrow in San Paulo directly with glow directly from by torch, another way is to use a tensor flow, which is an open source machine learning framework developed by Google, which is widely used in production, can be used to produce a buffer format, which can be converted to a next
next change, format and then direct directly imported glow. By using any of this method tomorrow, is then passed on to the globe, which performs a Target specific optimization for an XP devices for Target using arm. Cortex-m devices, to go back into voice integration with arm Simpson and library for Target. Using the Cadence in silica Hi-Point, 40 ft devices to go back and provide integration with the Hi-Fi and then the library. the final product of the compilation chain, is the Sokol the bundle, which is a self-contained Library which can be
directly compiled by the tool chain, along with other application out Next, a couple of words about multiple ways in which required to run into Target for just-in-time compiling. This is the most sought-after Frameworks in this case, the glow or software runs natively on to Target device. On network model is transferred from the old machine, to the Target was broken files. Tomorrow in machine, executable still compiling the program, while also executed
his name just in time compilation or dynamic translation. This approach has flexible in the sense that all that is required from the hose machine used to provide the free train model with no extra intervention and also dynamic in the standings that the model. Which runs along the target can be replaced easily on the fly. Without disrupting the target from running. For example, with a new model with potentially better performance and actress off of the model could as well be provided from a cloud database for cloud computing. On the right
hand side of the fly. With the other way. Three runs runs natively on the hose device where is Groton files tomorrow in, which is the later integrated with an application and transfer machine is ready to run on the target. This technique of comparing the program before, executing his name's ahead of time, Malaysian or Statics translation. This approach has the benefit of an improved memory footprint in the Target. Code, does not include the Glow Run time, but only the library, which contains the minimal amount of
time we can string devices, for example, and other devices. Like microcontrollers. I want drawback is that this method the static in the stands that in order to replace the motor running on target invention of the old machine, is required to compile and unification code. And also the target must be disrupted to load the new program. I will show you a couple of a couple of things compilation pipeline, which is blowing, fuses the translation of between high-level mathematical operators and low-level machine instructions as word technique
which is used in conjunction with the gym compiler infrastructure. First blow analyzes, the model of graph of the neural network and math, teach layer to high-level mathematical operators, which is closer monologist called and no further. It snowed is lower that is is broken into smaller fine-grained, operation called instructions or this lower level representation is the memory away in the sense that the instructions operating data values, which a reference by the address. I just enabled low-level memory of my vacations that are not possible into
higher-level domain. Remember is not represented directly. Also, this representation is thyme away. In to this level scheduling occurs Max, the glow instruction, suffered the Lord, which of the gym, compiler up to this point every representation is Hardware independent such that the final stage, the final step is to hand over that load, Denver presentation to the volume compiler. Which. It's Magic and generates optimize machine code for the desired Target
architecture. Next, a couple of a couple of words about how to use, basically at this technology. So the glow computer technology can be used through command line, the front and tools, which are available in the windows environment, we have the so-called Moto compiler is you're supposed to follow a model or given Target device on next is the model profiler which is used to quantize the model before compiling. You in order to achieve a better memory footprint around 4 and optionally with the news, also the most Junior, which can be used to optimize the consultation process for better accuracy.
Broken imported directly photo formats like a face-to or an ex and tensorflow Lite directly by using, converted to Goodberry's model format directly supported by Chloe. When tomorrow is compiled Globe bicycle generator cuando which is a collection of the following yard art effect. It's a. So file which is a self-contained binary object file which has no operating system requirements or other runtime dependencies. I also. Sage file which is an Ulta generated Heather far which
exposes the library API. That is the interface. Wait. Bin file which contains the model way he realized in binary format and the same content. Wait. EXT which contains the same all the weight. But this time serialized in text format which allows including the model waiting in an application is a simple. Apart from using the head of time compilation in order to reduce any proof or if we can make the memory footprint. We also have this mechanism than equal concentration, which is used to further, improve the
memory footprint. The consultation process consists in transforming tomorrow. In order to use small integer operations example, 8 bit, integer operations, instead of the usual 32 bits floating Point operations as achieving a 4X memo your conversation, from the user perspective. In order to comply three-point model without quantization, one would simply use the models compiler, command line, tools are suggested on the left half of the flight which is provided with the model options in order to produce the bundle. On the other hand
model one would need to use for Tomatoes, profiler command line tool, which is provided with the motor finance. Some sampling procedure to producer profile information file in. I m l format which contains all the required information in order to properly quantize tomorrow. And then only after that one would need to use the motor compiler command line to, which is provided with the mother filing the profile information previously obtained from the motor in
order to produce the final Quando Next time, we'll talk about the actual optimizations. Would you stop after this point? Everything was generics up to this point. Everything was very, is basically a blow out of the box. Couple of words of what we did in order to optimize the Globe for best of old Gibsonton. And a couple of words to the library collection of which are developed maximize the performance by using specialized Hardware instructions. Like, if I'm D, that is single instruction, multiple things,
which are destruction, capable capable to process multiple data offerings at once. And also, to minimize the memory footprint to buy, mostly you're using a bit integer. Operation is the developed Arm cortex-m processors and fluids include kennel. Implementations for the most common operators found in Newell net. Worth fly to the convolution Macaroni average bullying. But it is important to note that is not all encompassing. Meaning that seems the man does not support for all the operators, we can be found in in the new world Network. Do what we did with the
GMC Terrain and basically we integrated the screen, the library or within glow, a glow on. You on one hand has its own library with your narrative. Voice for particular hardware-accelerated out of the box performance of Love or especially for arm cortex-m devices. When did The Simpsons End Library within glows? Internal Graphics immigration pipeline are using the following. The logic operator, within the initial Motorcraft is supported by Steve, then blow creates a function call to one of The
Simpsons. An internal, it's not the end of the volcano is lamentation, use from close internal General, turn the library by the following benefits in marriage, still basically operation operators within the grasp of Orthopedic institution or specialized resulting in better performance, but on the other hand, operator, Which are not supported by CNC seminar compiled using the default glowing plantations ensuring that the model can be compiled and executed, okay. So I know that was a, a condensed version of the, the globe
technical details. But now, let me just get to the, to the real punchline hear. What? What benefits, does glow really bringing? You can see in this slide again, using the CFR 10 Benchmark, comparison of the optimized glow. With Simpson sent an integrated compared to the tensorflow Lite, implementation. That mind you that tentacle light is also very progressively being improved. So, these numbers will change over time and probably have already changed since the development of this presentation. But suffice it to say that there is a significant
performance increase of using a run time, versus they compiled version. Go with the optimization to send Sims to send in back in so you can see right off the bat that you get a 3X increase using this RT. 1060 know, if you also scale up to the RT 1170 with a 1-1 gigahertz core, you get a linear linear Improvement on the performance here. So pretty substantial increase know where you off to see benefits is chipping pointed out due to the colonization factor is the difference between memory sizes. And you can see that the memory
utilization for glow is two orders of magnitude lower than with the temperature feel like right now, mind you that the The Benchmark use for the TF flight was also quantize. So we are more or less doing an apples-to-apples comparison, but you can see that because of the overhead of the runtime Engine versus the compiled motive glow, that the memory savings is very substantial. So on that note, I want to thank you very much for listening to our presentation. And if you have any questions, you can contact us directly. Thank you again.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.