Priya Gupta is a software engineer in the TensorFlow team at Google. She is working on making it easier to run TensorFlow in a distributed environment. She is passionate about technology and education, and wants machine learning to be accessible to everyone. Prior to joining TensorFlow, she worked at Coursera, and in the Mobile Ads team at Google.View the profile
Anjali is a member of the TensorFlow team at Google. She enjoys working on problems at the intersection of large scale distributed systems and machine learning. She has previously worked on developing cloud services at AWS and Google.View the profile
About the talk
To efficiently train machine learning models, you will often need to scale your training to multiple GPUs, or even multiple machines. TensorFlow now offers rich functionality to achieve this with just a few lines of code. Join this session to learn how to set this up.
My name is Freya and I'm on slave revolt software Engineers on the tensorflow team working on distributed tensorflow. We're so excited to be here today to tell you about distributed tensorflow training. Let me grab the clicker. Okay. Hopefully most of you know, what tensorflow is. It's an open-source machine learning framework used extensively but inside and outside, Google. Example if you try the smart compose feature that was launched a couple of days ago that feature uses tensorflow.
Tensorflow allows you to build train and predict using your electric such as this. In training, we learn the parameters of the network using data. Training complex neural networks with large amounts of data can often take a long time in the grass here. You can you can see the training time on the x-axis and the accuracy of predictions on the y-axis. This is taken from training at the image recognition model on a single GPU. As you can see I took more than 80 hours to get to 75% accuracy.
If you have some experience running complex machine learning models, this might sound rather familiar to you. It might make you feel something like this. If your training takes only a few minutes to a few hours, you'll be productive and happy and you can try out new ideas faster. But it starts to take a few days. Maybe you could still manage and run a few things in parallel. When is faster take a few weeks your progress will slow down and it becomes expensive to try out every new idea. I'm
incest to take more than a month. I think it's not even worth thinking about. And this is not an exaggeration training complex models such as the resnet50 that we'll talk about later in the talk can take up to a week on a single but powerful GPU like a Tesla p100. So natural question to ask is how can we make training fast? The number of things you can try you can use a faster accelerator such as a TPU or tensor Processing Unit. I'm sure you've heard all about them in the last couple of days here. Your input pipeline might be the ball like so you
can work on that and make that faster. Their number of guidelines on the tensorflow website that you can try and improve the performance are your training. And this talk will focus on distributor training that is running training in parallel on multiple devices such as people use gpus RT fuse in order to make your training faster. With the techniques that we talked about in the stalk, you can bring down your training time from weeks 2 hours with just a few lines of code and a few powerful gpus in the grass here. You can
see the images for second processed image recognition model as you can see as we increase the number of gpus from 12428 the images for second process. It can almost doubled every time will come back to these performance numbers later with more details. Before diving into the details of how you can get that kind of scaling and tensorflow first. I want to cover a few high-level Concepts and architectures and distributor training. This will give us a strong foundation with which to understand the various Solutions.
Ask your focus on training today. Let's take a look at what a typical training Loop looks like let's say you have a simple model like this with a couple of hidden. There's each layer has a bunch of weights and biases also called the model parameters are trainable variables. A training staff begins with some processing on the input data within feed this and put into the motto and compute the predictions in the fall or pass. We then compare the predictions with the input label and compare the lock compute the last. Then in the backward pass we
compute the gradients. And finally we update the model parameters using these gradients process is known as one fan except. And the training Loop repeat this rain except until you reach the desired accuracy. Let's say you begin your training with a simple machine under your desk with a multi-core CPU scaling onto a multi-core CPU for you automatically. Next you may speed up by add to go to activator to your machine such as a GPU or a TPU. Beer distributor training you can go even further
you can go from one machine with a single device to one machine with multiple devices and finally to multiple with possibly multiple devices each connected over the network. What is the number of techniques eventually it's possible to scale to hundreds of devices. And that's indeed what we do and a lot of Google systems. By the way, and the rest of the stop, we'll use the terms device worker or accelerator to refer to processing and it's such as gpus RT fuse. So how does distributor
training work? Like everything else in software engineering their number of ways to go about when you think about Distributing your training. What approach your fake depends on the size of your model the amount of training day that you have and the available devices? The most common architecture and distributed training and what is what is known as data parallelism? Indeed. Parallelism, we're done the same model and computation on each worker. But with a different slice of the input data each device computes The Lost and the gradients reuses
radiance to update the model parameters, and the updated model is then used in the next round of competition. There are two common approaches when you think about how do you update the model using these gradients? The first approach is what is known as asynchronous parameter server approach. And this approach the designate some some devices as parameter servers as shown in blue hear these servers hold the parameters of your model. Others are designated as workers actually green here
workers do the bulk of the computation. Each worker faces the parameters from the parameter server. It don't confuse the lost and gradients. It sends the gradients back to the parameter server worse than updates Tomatoes parameters using these gradient. Each worker does this independently, so this allows us to scale this approach to a large number of workers. This has worked. Well for many models in Google. We're training workers might be preempted by hyper D production jobs or where they say symmetry between the different workers and
machines might go down for regular maintenance. And all of this doesn't hurt the scaling because the workers are not reading on each other. The downside of this approach. However, is that workers can get out of sync their Computing their gradients on steel parameter values and the Scandal a convergence. The second approach is what is knowing that synchronous already used. This approach has become more common with the rise of fast accelerators such as people use our gpus. In this approach. Each worker has a
copy of parameters on its own special parameter servers each worker compute the Lost ingredients based on a subset of training samples. What's the gradients are computed the workers communicate among themselves to propagate the gradients and update them all those parameters? All the workers that synchronize which means that the next round of competition doesn't begin until each worker has received the updated gradients and updated its motto. But you have fast devices in a controlled environment the variance of Step
time between the different workers can be small. When combined with strong communication links between the different devices. Overall overhead off synchronization can be small so whenever practical this approach can lead to faster conversions. A class of algorithms called reduce can be used to efficiently combine the radius across the different workers. All videos Aggregates the values from different workers for example by adding them up and then copying them to the different workers. It's a fuse algorithm that can be very
efficient and it can reduce the overhead of synchronization of Radiance by a lot. There are many already has algorithms available depending on the type of communication available between the different workers. One common algorithm is what is known as Ringold reduce each worker census gradients to success around the ring and receives gradients from his predecessor. There are a few more rounds of gradient exchanges. I want to go into the details here, but at the end of the algorithm each worker has received a copy
of the combined gradients. Ringold reduce uses Network bandwidth optimally because of uses both the upload and download bandwidth at each worker. It can also overlap the gradient competition at lower layers in the network with transmission of radiant at the higher layer, which means it can further reduce the training time. Turn on reduce is just one approach and some Hardware vendors Supply specialized implementations of all reviews for their hardware. For example, the Nvidia nickel. We have a Google working on Fast implementations of
all reviews for various device to follow Jesus. The bottom line is that all reduce candy fast when working with multiple devices on a single machine or multiple or a small number of machines. So given these two rod architectures in data parallelism during which approach should you pick? The reason one right? Answer ever approach is preferable if you have a large number of not so powerful or not too reliable machines, for example, if you have a large cluster of machines with just a fuse The synchronous
already has approached on the other hand is preferable. If you have fast devices with strong communication links such as TV use or multiple gpus on a single machine. Primary server approach has been around for a while and it has been supported while intensive to use on the other hand. You was already ordered you support out of the box. In the next section of this talk will show you how you can scare your training using The Oddities approach on multiple gpus with just a few lines of code. Before I get into that I just wanted to mention another type of
distributor training known as model parallelism that you may have heard of. A simple way to think about Martin parallelism is when your model is so big that it doesn't fit in the memory of my device. So you divide the model into smaller parts and you can do those competitions on different workers with the same training samples. For example, you could put different layers of your model on different devices. Business however, most devices have big enough memory that most models can fit in their memory. And the rest of this talk will continue to focus on data parallelism.
Now that you're armed with fundamentals of distributor training architectures. Let's see how you can do this in tensorflow. As I already mentioned. We're going to focus on scaling to multiple gpus architecture. In order to do so easily and please turn to reduce the new distribution strategy API. This allows you to distribute your training with very little modification to your code. What distribution strategy API you no longer need to place your apps are parameters on specific devices
in a way that the gradients are losses across devices are aggregated correctly. Distribution distribution style jazz that for you it is easy to use and faster train. Now, let's look at some code to see how you can do. This intense. Have a safety eye. A non example we're going to be using tensorflow is high-level API call estimator. If your views does API before you might be familiar with the following snippet of code to create a custom estimator. It requires three arguments. The first one is a function that defines your model so defines the parameters of your model how
you compute The Lost and the gradients and how to update the model parameters. The second argument is a directory where you want to persist the stage off Yamato. And the third argument is a configuration cauldron fanfic where you can specify things like how often do you want to check find how often some research be saved and so on in this case reviews the default when config? What's your favorite the estimator? You can start your training by calling the train method with the input function that provides your training data. Forgiven this code to
do the training on one device. How can you change it to run on multiple gpus? You simply need to add one line of code instantiate something called weird strategy and pass it to the Run config call. That's it. That's all the questions you need to scale this code to multiple gpus. Weird strategy is a type of distribution strategy API that I just mentioned. What does API you don't need to make any changes to your model function or your input function or your training Loop? You don't even need to specify your devices. If
you want to run on all available devices, it will automatically detect that and run your training on all available gpus. Blessed those are all the chord changes you need is available in TF contrave now and you can use it. You can try it out today. Let me quickly talk about what Mary strategy does strategy implement. The synchronous already has architecture that we talked about out-of-the-box for you. Inner strategy, the models parameters are reared across the Raiders devices. Hence the name your strategy
each device computes the lost and gradients based on a subset of the input data. The gradients are then aggregated across the workers using an altered use algorithm. That is appropriate for your device to Pieology. As I already mentioned with mirror strategy, you don't need to make any changes to a model or your training Loop. This is because you've changed and a line components of tensorflow to redistribution aware. For example, Optimizer Backstrom some reasons sexual. You don't need to make any changes to
your input function either as long as you're using the recommended tensorflow dataset API. Saving and check finding work seamlessly so you can save with one or no distribution strategy and resume with another. And summaries work as expected as well. So you can continue to visualize your training intensive award. Weird strategy is just one type of distribution strategy and we're working on a few others for a variety of use cases. I don't know how to drive to Anjali to show you some cool demos and performance numbers.
Thanks for the great introduction to Marriott strategy. before we run the demo I'm going to be running the resnet50 model from the tensorflow model Garden. Resnet50 is an image classification model that has 50 layers connections for efficient radiant floor. The pets of Rome Water Garden is a repo whether a collection of different models of high-level apis. I'm going to be using the imagenet dataset as in Porto model training that have been categorized into a thousand
labels. I'm going to instantiate the N1 standard instance on GCE and attach Nvidia Tesla V100 gpus. Let's run the demo now. As I mentioned, I'm creating and N1 standard instance. attaching age Nvidia Tesla V100 overcharge reviews I also attached SSD disk disk contains the internet data set which was input to a model training. Toronto tensorflow model we need to install a few drivers enter packages and here is a gist with all the commands to acquire. I'm going to make this just
public so you can set up an instance yourself and try running the model. Let's open an SSH connection to the instance by clicking on a button Hill. This should bring up a terminal like this. So I've already flown the garden model repo directory. We're going to run the msf main file. So we're using the image metadata set a bat size of 1024 or 128 for CPU. Amada directory is going to point to the GCS bucket that's going to hold our checkpoints in some research. We want to save we point our data directory to the SD disk Welch as
they mention I did I set and the number of gpus is 8 over which we wanted us to do a Dutch braid on water. So let's run this model now and as a model is starting to train smarter function. So this is a resident main function in the garden model repo. First Winston sheet the middle strategy object. Denby Fawcett to the rank and fake started the Trane distributor argument. We create an estimate our object with the Rankin Fick. And then before train on this estimator object and that said those are all the color changes. You need to distribute the resnet model. Let's go back and see how our
training is going. So I've run out for a few hundred steps at the bottom of the screen. You should see a few metrics the losses displeasing Steps II. So this is from Iran were on the model for 90000 steps that have lava that has started. So the orange and red lines are the training and evaluation losses. So as the number of steps increases you see the Lost decreasing. Let's look at a valuation accuracy. And this is when we're training resnet50 or 8 reviews. Totally see that around 9 to 1,000 steps were able to achieve a 75% accuracy.
Let's see what it looks like when we're on it on a single chip you so let's toggle that I'm so bored buttons on the left and look at the link between I'm all alone ones. If you saw the blue lines at 1 CPU and red or orange and ate a fuse location accuracy as opposed to one. So we're on the same model for the same amount of time. Run is over multiple gpus very able to achieve higher accuracy faster or trainer model faster. Catch a few performance benchmarks on the DG X100 models.
where are they mixed mixed position training with a faulty fuel bat size of 256 it also has a So the graph shows x axis is on the x-axis and images II on the y-axis from one device speed up of 7X. I misses performance right out of the box with no tuning. We're actively working on improving performance so that you are able to achieve more speed up and get more images II when you distribute your model across multiple gpus. So far we've been talking about the four part of model training
and distributing a model using a strategy you're going to expect to see the same kind of boost in images II when you do that as many as compared to you may not see the boostin performance. And the reason for that is also the import pipeline. When you run a model on a single GPU the input pipeline SP processing the data and making the data available for training. But if it was okay, if you was as you know process and computer data much faster than CPU across
multiple gpus. It would keep up with the training it quickly becomes a bottleneck. For the rest of the talk. I'm going to show you how tensorflow makes it easy for you to use TM data apis to build efficient and performance in Port pipe lights. She has a simple and pot pipeline for resnet50. We're going to use Steve. A. API because data sets are awesome. They help us very complex pipeline using simple reusable thesis a different date formats and it want to perform complex transmissions on this data. You want to be using T of data if he
is to Bill Jordan pot pipeline. First we're going to use the list files API to get the list of import files that contain your image and labels. Then we're going to read these pies using the TF record dataset reader. We're going to shuffle the records. Repeat them a few times depending on if you want to run a couple of a fox. I'm finally a flyer map transformation. So this process each recorded a flyers Frost transmission such as cropping slipping image decoding the input
input into a bat size that you desire. The inter pipeline can be thought of as an ETL process which is extract transform and load process. In the extra space where reading from persistent storage which can be local or remote. In the transform face. We're applying the different Transformations like Shuffle repeat map and batch. I'm finally in the Lord face. We're providing this process data to the accelerator for training. So how does this apply to the examples that we just saw?
And extract face will it suffice and read it using the t s record dataset reader? And the transform face reapply the shuffle repeat map and bass transformation. I'm finally in the lower face with tell sensor flow how to grab the data from the data set. This is water in for pipeline. Looks like we have the extract transform and load phases happening sequentially followed by the training on the accelerator. This means when the when the city was busy processing the data, the accelerator does idle and straining your model the CPU is idle.
But the different phases of the ETL process use different Hardware resources, for example, the extract step uses. The position storage is a different course of the CPU and finally the training happens on the accelerator. So if you can paralyze these different phases then we can overlap the pre-processing of data on the CPU with training of the modern on the GPU. This is called pipelining. So we could use 5 lightning and some paralyzation techniques to build more efficient and for Pipelines.
Let's look at a few of these techniques. Best you can paralyze 5 reading across the cloud storage service when you sent this using the nun Pearlridge called espanol in Sandra. This allows you to increase your effective support. We can also find lice map function for transformations. You can say that but lies the dresser different transmission for the function by using the num pad level cause argument. Typically, the number of the argument we provide is the number of course of the CPU. I'm
finally you want to call prefetch at the end of your input pipeline. The data is produced from the time that is consumed. This means that you can buy for data for the next training stuff while they ask the writer is still training the current step. This is what we had before. And this is what we can get an improvement on. Tell the different phases of the import pipeline are happening in parallel with training. We're able to see that the CPU is free processing data for the training Step 2
while they ask in return is still training Step 1. Leader of the CPU now the actuator is idle for long periods of time. The training time is now a maximum of pre-processing and training all the accelerator. As you can see the excavator is still not 100% utilized. The few Advanced Techniques that begins at warrant pipeline to improve test. We can use fuse transformation Ops of some of these API calls. Shuffle and repeat for example Can be replaced by its equal and steals off. So this paralyzes buffering Elementary fork and plus one will trade
using elements for Epoch and we can also replace map and bath fitters equilence to stop. The spiralizer spiralizers the map transformation with the sensors to batch. With these techniques were able to process data much faster and make it available to the actual rate of what training. and improve the training speech I hope this gives you a good idea of how you can use Kiev. It is to build efficient and performance in Port pipelines when you train your model.
So far we've been talking about raining on a single machine and multiple devices. But what if you want to train on multiple machines you can use as the estimators train and evaluate API. trade-in value at API uses the approach This API is used widely within Google and it's a large number of machine. Here's a link to the API where you can learn more and how to use it. We're also excited to be working on Amber's new distribution strategies. Play volcano on a mulching machine Murat strategy, which allows you to
distribute your modern across many machines with many devices. Where was The Walking on distribution strategy support to TV use and directly into years. Kara's? And this talk we talked a lot about the different concepts related to distributor training architectures and API. But when you go home today, there are three things for you to keep in mind when you train your model. Distribute your training to make it faster. To do this you want to use distribution strategy if he is
very easy to use and a fast. Input pipeline performance is important. Steve. Data API stupid officiants in Port Pipelines killer feel tensorflow resources First we have the distribution strategy Avi. You can try using mirrored strategy to train your model across multiple gpus. Here's a link to the resnet50 model Garden example, so you can try running this example a test strategy Avi supported. Canceling closer to the airport pipeline performance guy which has more techniques that you can use to build efficient and pot pipe.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.