Worked with extensive set of Neural Network Architectures such as CNN, RNN and GANs, for problems including reader interest prediction, sentiment analysis, visual questions answering using Dynamic Memory Networks and tissue segmentation for ultrasound images. (Mostly using Tensorflow/Pytorch).Also developed probabilistic models (Online Latent Dirichlet Allocation) based techniques for feature design.View the profile
About the talk
Cloud TPU Pods are machine learning supercomputers that enable enterprises to train large machine learning models at a scale otherwise difficult or sometimes impossible. They deliver business value through solving many machine learning problems including image search, neural architecture search and large scale language models both for Google and Google Cloud Customer. However, until a few months ago TPU devices and pods were limited to tensorflow models. Much awaited support for pytorch was announced in Pytorch conference 2019. The list of models proven to TPUs continues to grow ever since. In this session we demystify the art and science of training your pytorch models on TPUs.
Speakers: Taylan Bilal, Vaibhav Singh
Google Cloud Next ’20: OnAir → https://goo.gle/next2020
Subscribe to the GCP Channel → https://goo.gle/GCP
product: Cloud TPU; fullname: Taylan Bilal, Vaibhav Singh;
event: Google Cloud Next 2020; re_ty: Publish;
Hello, my name is Bob, Howard saying, I'm a machine learning specialist working with Google cloud. And here is a software engineer in Google Cloud, working on TV, And today we are going to talk about pie chart a scale on couch TV use. So let's get started. And who's going to be our map for the dog? Give me to start with an introduction. Then we'll move to step-by-step walkthrough of a example. We will discuss how it exactly was how to scale your python string on top.
Going to leave you with resources so that you can start tomorrow on TV today. I thought she is officially supported on child abuse. This support is made possible to the projects are the library and open source initiative from Facebook and Google. It was announced in October 2019 at $5, gift card and has been growing ever since the support reached beater with by torch, 1.5 release, and would be generally available by the time you get this recording. For those of you who are not very familiar with. If
you use TPS or dance around processing unit is a hardware accelerator built by Google in 2015, was the very first generation of the abuse was an inference only check with version to it. Evolve to support training as well. And beginning with version to Google. Also made TVs are available to the rest of the world through. Google cloud with the vision to help push machine learning state-of-the-art forward. You can scale your Machinery, work clothes using, keep you from 8, all the way to tfue part consists of thousands of CPU cores. As you can see
on the Shelf, you a screen comes in two different flavors on the device. As you can see on the left top corner of your screen, all the way to the food pod account. If you eat bad comes with 512 close. And I thought if you really bad comes the 2048 course, you can choose a slice of a bar all the way to the foot pad for your machine gun in Chinese. Here is a cockapoo device in order to access this device. I use it creates a VMware virtual machine on Google cloud and Furious an instance of God, keep you. You can create this instance in
different device types. In this case, we are choosing v38 that in the case of BT device, hp-38 comes with four ships. Each of these chips, Ascot course. And each of these course, come equipped with matrix multiplication units each and 16GB of high bandwidth, memory parkour. All in all this device has got eight goals. And that's why when you choose the x 8, in case of a cardi B with root of 256, GB devices are 2048 Which lets the the gradient averaging are all videos of operation, run very efficiently on the full part
in all the way from a TV device to the park or any intermediate size without any code changes. And it always acts as a single note, without any complicated setup. No, let's take a look into an example of a bite or screwed as we know it and then see how how the school needs to be modified or what are the modification typically see in order to run that by. I unplugged it. Let's begin with your family. For those of you who are familiar with distributor training on 5. You have already seen
a similar examples of elements such as the distributor Library. The model being transferred to the devices. In this case, the devices are actually looking at the same. You see family are elements like the train. You see the the forward pass within the training Loop and the backward pass which compute the gradient, you see the optimizers stuff that is going to update based on which this is pretty good example, for replicas of this model running in to see how this
this code changes when we are going to train it. As I was, describing a cloudy, pee device consists of portrait view chips on a date with the elements, like talk sexually Library. This is what makes the 2005 Dodge model Aunty Pugh possible. Similar to what you sign case of GPU. While we have a we have another device abstraction call access device. This is an abstraction of Cloudy, puke on how you were transferring, your model on top of you device. You are transferring
your mother and child and finally, the slight the distributor, sampler, All Digital or remain exactly the same with one difference that we are using a paddlin loader which is the rapper that make sure that the data upload operations to to diffuse around in Palo, Verde complications in order to make the day go by Frank Ocean. One more important distinction to highlight. You're using the optimizers step method from XLE model model in the previous example, this is to make sure that the audit is Operation is performed first.
Before the optimal that step is called to update the parameters. And finally, instead of the Dodge multi-processing model, you have excellent multipass, the same model, which has the similar spawn method by which, under the who takes care of all the necessary to device initialization and set up before using the Dodge body, passing, Mario to start again in a similar fashion of the process of running a training on eight different tropical off your model, in this case, because we are running on. And also noticed, they start method is used to make sure that the model is
copied into the name of the only. Once each of the child process will be able to use all the resources of span up before that in the global scope. Now, let's look at the light example of one of the training stuff you're going to do on Jasmine, 18 model, using the sea part 10. I start with a cloud console. I have set up a flower, a part from notebook. Instance, with hydrotech City, library and all the necessary components ready to go. So I start my notebook in Spanish. In the meantime, notice also that I have created.
Cloud. Give you instant. That's all he ate chips. With course, that is displayed in this column whenever you create a device. And so the internal I used to put up xrt to Picante Chicken Ramen. Before you start the training required for the user via or european to be able to communicate to the TV Munsters. Don't forget about the demo tape. I talked to the 2018 model on C4, 10%. earlier, he started by studying the absurdity pecan fig in ramen, and you find some pictures. Do some deliveries. There are using the resonating model
from Dodge version Library. Take me to fire training Lube. Notice in the training Loop that apart from the test Loop function in an example. Walk through there. We also have to test you function to find for a tracking device on the results of the aspect of the coach. this is a function that is going to call the training method, which will happen in the loop on Santa Fe sauce, using the word using C, A P function, as one of the argument, then passing passing the other flags, did
you find the number of cows and dark matter have to drive in the walk-through and then restart This training is a starting to run event to look at one or another instance of this notebook. Which has this whole training institution finished. And I would notice that make the cheapest 24 20 box. That is going to take a few minutes to finish after 20 box. Close to 79%. You can also notice that it is possible to track things like images for 2nd and other metrics as you would do in Ann Arbor by Dodge code. If you're getting almost
76% of the accuracy and then we're ready to visualize the results. And this is the results here, and we can see that for most of the examples, Are cases of skeletons. Now that your family are with project Cellar usage, you may be wondering how exactly it works. If your family was by Dodge, tensed, Ross, or Aiden you already know the implementation. Excellent dancer. Dancer is different from other In. That is the lazy dancer. White Egret answers are dispatched all by IHOP and the only competition graph that is generated is 4 degree in computation. And as soon as the competition of the
backward classes done to drop is destroyed, With no evaluation is performed on this craft until require either by us certain operation in your model code itself or buy a call that necessity contacts transfer from the answer to Stellar building. This is how to enable spy, Dodge Excel to do certain Grappler optimizations, which are not possible. If you are doing the albeit off dispatch, this is been sent to TV. Device isn't Malaysian is not perform, is our group, is going to change to all because of the nature of the code itself. Every time I
need my training to take more details into what are the things to watch out for and how can make the most of your pie crust? It is my calling about tips and tricks. Alright thanks, bye, bye love. So I'll let me talk about some of the tips and tricks things to watch out for. When you port, your model from GPU card to tip, you called one of the most common pitfalls is dynamic input shapes. As why every time the infirmary presentation, cast, changes recompilation, that slows things down. And this happens when the model has input shapes
that change from one step to another. Good example is in an LG Moto sentences off of varying lengths, for example, for some images championships. What we call the contacts, change between a device to the CPU on the most common insurance for that is when the code has cancer. Item calls, this slows things down because as their leader is built on the lazy cancer idea and Adam called maturity causes evaluation. And model code has is using some operation that has no excellent imitation. I get then that doesn't cause and everybody
have to get caught and the operation does not lower text Saleh get executed on the CPU which slows things that we want to stay in the device contacts as much as possible. When that happens, these things happen. We have solution. For example, for the first problem, Dynamic shapes, use padding and padded suit to a fixed blank or maybe we can, and we will still have recompilation because the input shapes will take three or four different values and each time the first time they are seeing we're going to
compile but we're going to waste with plus as a result of less Patty. So there's a trade-off there. Second while we have the chance of scalar contacts. Change devices exchange. This going to happen because of sore example. Lost reporting losses on usually attend, sir. I will be recommended to reduce the item cost report lost every step. And to take a value of 2030 hundred that's up to the user. Do you also have item calls from another parcel tomorrow? They need to be spotted an enemy. In the third case, if we encounter a top with, no,
we usually just get help.com away and asking for Chick-fil-A team to lower this at all. So that the operation runs on the device and is accelerated as a result. We have talked about problems and a solution. How do you actually understand what is going on and what the problem actually is a? So this is the flight from earlier, we want to focus on the white rectangles. Here we are here and we are printing the metric metric support contains a lot of information about what's going on. It's the temp service on issues that I just discussed in the in the biggest one.
And let's look at an example of metric report. This is just a summary that focuses on the issues that we discussed. The actual Metro support contains more information than just this. If you look at the first part of the metrics compile time, formation about how many compilations have occurred in this workload. And in this case, we have two hundred and two populations, that occurred metrics. And the counters counters give information about what's happening. In code. For example, the counter Mark, step tells you that How many turns steps
that we have, we have taken, for example, in this case, 230, to step seven and 2 compilations, which suggests that there's a problem with the grass. The code is generating, the compilation of should be at low. Sound compilation initially. Okay, Edition grow as number of steps grow, for example, in the item. Call problem contacts to go dance will be, will be your friend here in this case, you have 240 and some of that and that's something that needs to be, reduced to get Optimal Performance in the third problem, that we discuss
The Optimist not been lowered yet. You will see an A-10 counter with the off sexual name. This case, We have one in 32, probably wants First Step. Some of that needs to be lower and next Saturday. And once you are, I ran those issues out, on much faster. It will be very performance. So far, we have seen. How can he works on a clock TV device. Now, let's discuss how to take you to the next level, and chain on a call to Peapod as osbert described by probably hear that. You can sit still
several devices connected. In this context, we create as many user BM as of the Akatsuki devices. Not course. So for example, if you want to, we would need to be at 4 p.m. to go to drive the training. And everything and we have set up these resources at the launching. The training distributed mode is very easy. Now, let's look at the command to do that here. The white part of the large command, is the actual running to run on one device. So, this is your code to scaling up to a pod. Level is, is just
prefix to command with, with these on. And the green part is accelerated away and these coordinates to resources set up the GPS and set up the environment variables specified, by the next. Set of elements, for example, of teaching, you name the environment and any other environment variable that you might want to set up. And that's it. There's really no code changes one. Your hair is the code needs to be so that the data is charted across. All the course of the VMAs will end up doing the same work.
That's really the only got you and scaling up is, is as easy as just be texting you. So now, let's look at some of the models that actually have been tested and known to work for in the computer vision area. We have resume along with most arteries are models such as mobile Nets, and others that already work in the Transformer Transformers. We having a twerking station off of a deep emotional, speech recognition world, we have Let me know if you hear that the models that work on limited to beat these have been tested
and other models successfully as well and you can put your model as well. To get started. Here are some resources. So, these are some of the tutorials that are offered. Offered on Google Cloud, white pages, and it says that I just discussed in a previous life. Along with the Google Cloud tutorials, they are from some of the chemical experts in the community that they have successfully competitions on to Chick-fil-A and they offer their Eternal to the users.
And finally he was a summary, slide out all the resources that we saw some of the GitHub repositories that are relevant to the project. Officials tutorial. Schedule channels and please visit active open source library and open issues if you encounter any issues, ETC. Thanks for listening.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.