Priya Gupta is a software engineer in the TensorFlow team at Google. She is working on making it easier to run TensorFlow in a distributed environment. She is passionate about technology and education, and wants machine learning to be accessible to everyone. Prior to joining TensorFlow, she worked at Coursera, and in the Mobile Ads team at Google.View the profile
About the talk
TensorFlow’s tf.distribute library helps you scale your model from a single GPU to multiple GPUs and finally to multiple machines using simple APIs that require very few changes to your existing code. Come learn about how you can use tf.distribute to scale your machine learning model on a variety of hardware platforms ranging from commercial cloud platforms to dedicated hardware. We will have a dedicated section talking about tools and tips to get the best scaling for your training in TensorFlow.
Presented by: Priya Gupta, Taylor Robie
I'm Taylor. I'm engineer on the TF care scheme. And they weren't talking about writing performance models in tensorflow 2. Just to give a brief outline of the talk. We're first going to talk about the structure of a training Loop and tensorflow to and then we're going to talk about the different apis that you can use to make your training more performant. Address API that when we talked about is to have data, this is the API for writing high performance pipeline to avoid various sorts of stalls and make sure that your training
always has data as it's ready to consume it. Next when you talk about your function, this is how you can formulate your training stuff in a way that is efficient. So that the run time can run with very minimal overhead. And finally when we talked about TF distribute, which is how to Target your training to specific hardware and also scale out to more Hardware. In order to demonstrate all of these apis, we're going to be walking through a case study starting with the most naive implementation and then progressively adding more performant apis and looking at how that helps our training. So
we're going to be training at classifier cuz we're an internet company. That's what we do and we're going to be trying it on mobile not be to this house was largely chosen arbitrarily just as a representative task. So these lessons should be broadly applicable to your work clothes as well. And there's a notebook and I'll put the link back at the end of the slide. So you are encouraged to take a look after word and compare back to the talk. A brief overview of training loops and tensorflow to we're going to be riding a custom Training Loop. And the reason is that we want to look at the
mechanics of what the system is actually doing and how we can make it performance ever used to using a higher level API like Harris model fit. These lessons are still broadly applicable. It's just that some of this will be done automatically for you. So if we want to show why these ideas are important it's useful to really dig into the internals of the train. So everything that we chose him to use the carrots applications mobilenet V2, which is just a can monat me to implementation. We're training a classifier. So we're going to be using sigmoid cross entropy with
logits is a numerically stable cross entropy function. And then finally we're going to be using the Karis SGD optimizer. Tell if we were to call her model on our features in this case is an image classification tests are features are an image will get largest which is the predicted class. If we then apply a loss function comparing the labels and logic this gives us a scalar loss which we could then use to optimize and update our model. I want to call all of this under TF gradient tape with the gradient tape does is it watches the computations as their performed and maintains the
metadata and keep the activation to live. This is what allows us to take gradients, which we can then pass the optimizer to update our model weights. And finally, if we wrap all of this into a step and then iterate over our data in many batches applying this step in each mini batch, this comprises the general structure of our training. On the data. Sorry, we're going to start out with just a strawman python generator to show the different parts of the data Pipeline and then we'll look at his performance a little bit later. So what do we need to do while first want to shuffle the data services
are different or during each Epoch will need to load all of the images from disk into memory so they can be consumed by the training process. So we'll want to resize the images from their native format back to the format that the model expects and then we'll do some task specific optimization. So in this case will randomly flip the images and will a pixel level noise to make our training a little bit more robust. And finally will need to Batch be so the generator that I've shown is producing examples one at a time, but will need to
collect them into many batches and the key operation for that batching operation is a concatenation. It's worth noting the different parts of the data pipeline will stress different parts of the system. So for loading the distance is an IO bound task and will generally want to consume this aisle as fast as possible so that we're not constantly waiting for images to arrive from this one at a time. Secondly, the augmentations tends to be rather cpu-intensive cuz we're doing various sorts of math in our augmentation and finally batching
tends to be somewhat memory-intensive task because we're copying examples from their original location into the memory buffer of army badge. So now we have scaffolding. Let's just go ahead and run this we're going to be running on an Nvidia V100 GPU by default tensorflow will attempt to place apps on the optimum device. And because we're doing a large Matrix multiply sort of operations. It will place them on the GPU. However, we find that our training at the start is pretty lackluster were under a hundred images of 2nd and if we actually look at our device utilization, it's well
under 20% So what's going on here? Well in order to determine that we can actually do some profiling and tons of CO2 in pencil board makes this quite easy to do. You might already be familiar with Center board. This is the standard way of monitoring training as it run. So you might be friends since familiar with these healers have which will show things like losses in metrics as your training is the profiler Tab and what this does is it looks at each other up as it runs in tensorflow and shows you a timeline of how your training is progressing and where to
look in a little bit more detail about how to read and interpret these timelines. Before we do however, I'll talk about how to enable it. So it's quite straightforward. You simply turn on the profiler. This will signal for the runtime that it should keep records of opposite runs them write your program as normal and then simply export this trace and this will put it in a format that tensorboard can consume and display for you. So this is the timeline moving left to right we have time and then each row is a different thread and you can see that they're annotated with device.
So the top row is the CPU and then the bottom ones are the various CPU threads. And in this case, we're looking at three different training steps. Each vertical line is an individual off that's been run by tensorflow. So you can see that the OPP's are scheduled from the CPU and then they appear on the GPU stream and the GPU actually execute the options is where the heavy lifting is done, but you can also see that we have these large gaps in between where no apps are running and our system is essentially stalled in this is because we have for each step. We have to wait
for Anaya generator to produce the next batch of data. This is obviously quite inefficient cell prison to tell you how we can improve that. Taylor Swift images showed us writing your own input type line in Python to read it on Transformers can be pretty inefficient principle provides the tfdata API to allow you to easily build performance and Caleb Olympic pipeline. You can think of the theater in COD pipeline as an ETL process. So the first stage is the extract stage where we read the data from the same network storage or from
your local disk. And then you'll potentially are parsing the file format. The second stage is a transform stage, which is where you take the input file data and transform it into a form that's amenable to your ml competition. So this could be image Transformations and Victory Road Emerald hot or they could be generic Transformations like shoveling or batching that apply to a lot of Emma's house. Once your friends from this data, you can download it into your accelerator for your training. Flex like take a look at some code to see how you can use to your data to ride the same input
pipeline that Taylor showed you before but in a much more efficient with the number of steps here and I'll go over them one by one. The first one is we create a simple data set consisting of all the file names in our input. So in this case, we have a number of image files and the path glob specifies how to find these filing. The second thing we do is we want to shuffle this list of filing as a cousin training typically want to shuffle input data that's coming in and we apply our transformation called the map confirmation. And before I get bison label custom function and what does
function does is that it's going to read the file 1 by 1 using the tfil read file API and abuses the filename Pat to compute the label and returns both of these. Three three steps here comprise extracts face that I talked about before reading the files and parsing the file data and there are number of different API is that you can use for other situations which we have listed here. The most notable one is a TF record at API, which you would use if your data is in the TF record file format, and this is potentially the most
performing format to use with tfdata. If you have your input in memory in a numpy array, let's say you can also use the front and two slices API to convert it into a tea updated it aside and do subsequent Transformations like this. The next thing we do is another matter consummation to now take this raw MSG. But can word it and do some processing to make it a memorable for our training task. So we have this process image function here with your provide to the snap confirmation. And if you take a look this good look something like this. We do some decoding of the image data and
let me apply some image processing transformation such as resizing equipping normalization and so on the key things to know it's here are these Transformations are actually very similar and cars 11212 what you saw before in the python version, but the big difference here is that now instead they're using 10 to blow off to do the transformation. The second thing is that give did I will take the custom function that you specified to the map consummation and it will treat it as a pencil invented in the C plus plus one time instead of python. And this is what allows it to make it much more
efficient than python. And finally reuse the bachelorettes formation to batch the elements of your input into many batches, and this is a very common practice for training efficiency in a motel. So before I had to go after Taylor doose talk about how to using TSA that can improve the performance and I'm all done that example. I want to walk to a few more advanced performance optimization tips for TSA. And I'll go through them quickly, but you can also read about them in much more detail on this on the page. That's just in here.
The first optimization that I want to talk about is pipelining and this is the conceptual eBay similar to any other software pipeline that you might be a verra. The idea here is that when you're training a single match update on your accelerator, we want to use the CP resource at the same time to process and prepare the next batch of data. What does will do it that when the next train step Stars. We don't have to wait for the next batch of data to be prepared. It will automatically be there and the skin reduce the overall painting time significantly. Soccer pipelining intensity of data is
very simple. You can use the prefix transformation as shown here. Second optimization that I want to talk about is paralyzing the transformation stage. So it's biting follow the map transformation will apply the custom function that you provide to each element of your introduced at NC. But if there's no dependency between these elements, there's no reason to do this in sequence, right? So you can paralyze the spot by passing the nun pylori cause argument to the map consummation and this indicates to the key of the one time then it should run these map operations on your elements of the
data set in parallel. And the third optimization that I want to talk about is paralyzing the extraction stage so similar to how transforming your elements in sequence can be slowed similarly reading files one by one can be slow as well. So, of course you want to paralyze it and in this example, since we are also since we're using a map confirmation to read our files the way you do it is actually very similar to which reaches you add an unparalleled cause argument to the map function that you have to read your files know that if you're using one of the building a
file reader such as a TF record that I said, you can also provide Bay similar arguments to that in order to paralyze the fighting there. So if you're being taken close attention, you'll notice that we have these magic numbers XY and Z on the slides and you might be wondering how do you determine the optimal values of these? I didn't reality. It's actually not very straightforward to compute the optimal use of EastEnders because if you said them to love you might not be using in a parallelism in your system. And if it said them to hide it might lead to contention and actually
have the opposite effect of what you want personality. Aveda makes it really easy to for you to specify these instead of specifying specific values for his deeds XYZ arguments. Are you can simply use this experimental autotune. And what does it indicate for the day of the year run time? That should do the auto-tuning for you and determine the optimal values for these arguments based on your workload. Your environment are set up and so on. That's all for today. I'm not going to hand it off to Taylor to talk about what kind of performance benefits you can see.
So on the right here, we can see the timeline before and after we had to have data. So before we have this long stall in between our training steps were waiting for data. Once we've used to have data. You don't know two things about this timeline after the first is that the training step begins immediately after the previous one and that's because tfdata was perpetually preparing y'all coming back while the previous training stuff was happening. The second is that before it actually took longer than the time of a batch in order to prepare the next batch. Where is now we see no large
stalls in the timeline. The reason is that if they died because of the Native parallelism cannot produce batches much more quickly than the training can consume them and you can do this manifest in our through with more than a 2 x Improvement. But sometimes there are even more stalls. So if we zoom in on the timeline of one of our training steps will see that there are number of these very small gaps and these are launched over here. So if we further look at different portions of the GPU stream near the end, this is what a healthy profile looks like.
You have one off runs and then finishes and immediately the next auction begin in this results in an efficient and saturated accelerator on the other hand in some of the earlier part of the model an APA schedule on the GPU and CPU immediately chews through it and then it simply wait idle for python 2 in Q the next op and this leads to very poor color utilization and efficiently. So what can we do? Well, if we go back to our training stuff we can simply add the t a function decorator what this will do is this will trace the entire training step into a TF graph which can then be run very
efficiently from the runtime and the only change needed and now if we look at a timeline we see that all of the scheduling is done in this single much faster up on the CPU and then work appears on the GPU as before and it's also allows us to use the rich library of grab optimizations that were developed in tensorflow one. And you can see this is again almost another factor of two Improvement and it's pretty obvious in the timeline on the right. Why where is before to function when you're running everything eagerly fit all of these little gaps waiting for
op now once it's compiled into a graph and launched from the cheapest one time it's able to do a much better job keeping up with the GPU. Play next optimization that I'm going to talk about is SLA which stands for Accelerated linear algebra to understand how I feel like works. We have just the simple example of a graph with some skip connections what axle I will do is it will cluster that graph into some grass and then compiled the entire subgraph into a very efficient fused Colonel and there are number of performance gains from using these few kernels to the first is that you
get much more efficient memory access because the colonel can use data while it's hot in the cash as opposed to pushing it all the way down the memory hierarchy and then bringing it all the way back. It also reduces the overhead from launch overhead. This is the C plus plus executor launch overhead by running few drops the same math and finally Escalade is optimized to Target Hardware. So it does things like use efficient Hardware specific specialized on shapes and choose layouts so that the hardware is able to have very efficient access patterns.
And it's quite straightforward to enable you simply use this tip config Optimizer set jet flag and this will cause every two function that runs to attempt to compile and run with Excel a and you can see that in our example. It's a very Stark improvements about a 2 and 1/2 x Improvement in throughput the one caveat with Excel a is it a lot of the optimizations that uses are based on specializing on shape. So actually needs to recompile every time it says new shapes. So if you have an extremely Dynamic model for instance, the shapes are different each bat. You might actually wind up
spending more time on the X-Files then you gain back from the more fishing competition. So you should definitely try out a x l a but if you see a performance regression rather than performance gains, it's likely that that's the reason So that's what I'm talking about Miss Precision. If we want to go even faster than we can give up some of our numeric stability in order to obtain faster training performance. So here I have the IEEE float 32, which is the default numeric representation in tensorflow, but they're also to have Precision format that are relevant. The first is be float16 where
we keep all of the exponent simply chop off 16 bits of man, and the other is plug 16 where we give up both some exponent in exchange for keeping a little bit more men Tessa and what's important is that they actually have native Hardware support So TPU has Hardware support for very efficient B-flat 16 operations and gpus have support for very efficient float16 operation. So if we can formulate our computation in these reduce Precision format, we can potentially get very high speed up from the hardware. In order to enable this we need to do a couple things first. We need to choose a
law scale. So what is a large-scale a lot scale is a constant that's inserted into the computation which doesn't change the mathematics of the computation but it does change the numeric and so this is a knob where the runtime can adjust the computation to keep it in a numerically stable range. Next thing I want to do is want to set a mixed Precision policy. And this will tell Karis that it should cast tensors as they flow through the model in order to make sure that the computation is actually happening in the correct Fulton Point representation in a custom Training Loop will want
to wrap her Optimizer in this loss kill Optimizer in this is the hook by which the law scale is inserted. So as training happen tensorflow will do a dynamic adjustment Service Las Calles de Balance ingredients from fp16 underflow while we're still preventing 16 overflow. If you're using the model spit workflow, this will be done for you. And finally, we generally need to increase our bats eyes when reading this mixed Precision training Precision makes Computing a single example much less expensive. So if we just turn on mix Precision, we can go from a saturated
accelerator in 432 to an underutilized accelerator in float16 to by increasing the bat size. We can go back to filling all of the hardware registers. And we can see this in our example if we just turn on float 16, there's actually no improvement in performance. But if we then increase our batch size, then we see a very substantial Improvement in performance. It is worth noting that because we both reduce the numerical precision and change the bat size that can require that you reaching I prefer amaturs. If we look at what are the remaining bits of
performance, so about 60% of what the left is actually the copy from the host CPU to the GPU now in a little bit. Going to talk about distribution and one of the things that the distribution of where will do is it will automatically pipeline does prefect so you'll actually get that at 60% for free and then finally you have hand tuning so you can give up some of the numeric stability you can mess with red pools. You can manually try and optimize layouts. This is included largely for completeness and cut case you're curious what real low-level hand turning looks like but for most cases given
that we can get the vast majority the performance with just simple idiomatic easy to use apis. This very fine hand tuning is not recommended. But once we've actually saturated single device we need to do something else to get another order of magnitude Improvement, and so pretty is going to talk to you about that. All right. So Taylor talked about a number of different things you can do to get the maximum performance out of a single machine with a single GPU. But if you want your training to go even faster you need to start
giving out so maybe you want to add more gpus to a machine or perhaps they want to train on multiple machines in a cluster or maybe you want to use specialized Hardware such as a cloud tpus. Principal provides distribution strategy API to allow you to do just that. the business API with Tweety goes in mind The first goal is he's a fuse has to be able to use the distribute strategy API with the changes to their code. The second goal is to give great performance out of the box. We don't want you to have to change their training
code or two and a lot of knobs to get the maximum efficiency out of their Hardware resources. And finally we want this API to work in a variety of different situations the brinson's if you want to scale out to multiple gpus or muscle machines or tissues or if you want just different distributor training architectures such a synchronous or asynchronous training. Are you using different types of API? So if maybe he risen kylo cares if he has like more less fit or you have a custom Training will be as an example or earlier distribution strategy to work in all of these potential
kisses. The number of different ways in which you can use this API and web list of them here in the order of increasing complexity. The first one is a fuse in cast high-level API for your training and you want to just meet your training and that's it up. Second Use cases what we've been talking about in this talk so far that you have a custom printing move and you want to scale out your training. XXX kisses maybe you don't have a specific training program, but maybe you're writing a custom where or a custom library and you want to make a
distribution aware. And finally, if you were experimenting with a new Distributing architecture and you want to create a new strategy and the park here, I'm only going to talk about the first two cases. Let's begin the first use case we'll talk about is if you have a single machine with multiple gpus and you want to scale up your training to the situation. For the setup, we provide mirror strategy strategy implements and cross training across multiple gpus on a single machine the entire computation of a model would be replicated on each
all the variables of your model would be replicated on egpu as well and they will be kept in sync using all reduce. Let me go to a step-by-step to talk about what does think restraining looks like? It was some gray boxes on these pictures are not really visible. But we have in our example. We have two devices or two gpus devices around 1 and we have a very simple model of a two sample are so they are A and B each layer has a single variable as you can see the variables are mirrored or replicated on these two devices. So in our
forward pass will give a different slice of our input data to each of these devices and in the board has they will compute the largest using the local copy of the variables on these devices. And the backward pass each device will then compute the gradients again using the local copy. What is device of computer the gradients? No communicate with each other and to aggravate is gradient. And this is where all reduce that I mentioned before comes in. All it is is a special class of algorithm that can be used to efficiently aggregate 10% of gradients across different devices and it
can reduce the overhead off. Synchronization by quite a bit their number of different search algorithms available and some Hardware vendors such as Nvidia also provide specialized implementations of older used for their Hardware such as the nickel. Swansea's a gradients having aggregated each. The aggregated results would be available on this device and each device can update its local copy of the variable using does aggregated tenders. What devices are kept in sync and the next power pass doesn't begin until all the variables have been updated.
So now let's take a look at some code to see how you can use mesh eyes. You just give up your training as a mansion real talk about to type of use cases. The first one is if using the carrot high-level EPA and then we'll come back to the custom printing Loop example after this. The cold here is some scales and chord to train the same mobilenet V2 model. But this time using the model that compile and fit ATI in carrots. In order to change this code to use mesh i g. All we need to do is add these two lines of code. The first thing you do is create a mirror strategy object.
And the second thing is you move the rest of your training code inside the school of the strategy. Putting things inside the school blood to take control of things like variable creation. So any variables that you created under the scope of the strategy will now be measured variable. You don't need to make any other changes to your code in this case because we've already made components of tensorflow distribution aware. So for instance, in this case, we have the optimizer which is distribution to wear as well as compiled and fit. The kids we just saw was the simplest way in which you can
create a new strategy. We can also customize it. So let's say by default. Let's use all the Jeep is available on your machine for training. But if you want to only use the big ones, you can specify them using the devices argument. You can also customize what is algorithm you want to use by using the cross-device argument. Saint Augustine how to use measure even using the high-low Icarus model that fit API now let's go back to playing Loop examples on before and see how you can modify that to use mirror strategy. And they're good more code in this example because when
you have a custom crane with you have more control over what exactly you want to distribute and so you need to do a little bit more work to distributor as well. So here is because the Skeleton on the custom Tango from before we have the model of the lost ones in the optimizer and you have your training step and then you have your outer loop if it's over your data and calls the training stuff. The first thing you need to do is the same as before you create them your strategy object and you moved the creation of the model the optimizer and so on inside the scope of the strategy and
the purpose of this is the same as before. But as I mentioned you need to do a few more things of this is not sufficient in this case and let's go over each of them one by one. The first thing you need to do is to distribute Yadira typically and if you are going to have dinner ideas, that's as your input. This is straightforward. All you need to do is call strategy. Experimental distribute. They decide on your data said and sister joins a distributed data set which you can send it right over in a very similar manner as before. The second thing you need to do is to scale your loss by the
global that size and this is very important if you so that the conversion characteristics off your model do not change and we provide a helper method in the Renton Library to to do so. And the third thing you need to do is to specify which specific computation you want to replicate send this case. We want to replicate our training step on eShop account. So you can use strategy. Experimental run me to API to provide the training step function as well as your distributor input and your abs whole thing in another T a function because we want all of this to
run as a TF Braswell. Those are all the score changes you need to make in order to take the classroom training glue from before and now run it in a distributed fashion using distribution charge API. So let's go back and take a look at what kind of scaling you can see. So just out of the boss adding this mirror strategy to the mobile light weed with example from before we're able to get 80% scaling from been going from one GPU to a GPS on a single machine. It's possible to get even more as giving out of this by doing some mineral optimization switch people and be going into today.
Give another example of the scaling run multi-chip you training with 4 retina 50, which is a very popular image classification Benchmark. And in this sample, we also used as an xoa techniques that Taylor talked about before and here you can see going from 1 to a Jeep use variables to get 85% scaling and this example was using the carousel model. Fetty WAP. Custom painting with example, if you're interested, you can look at the link in the bottom and you can try out this model yourself. Another example is using the Transformer language model to show the does not just 4 images were able to
scale of other types of models does and in this case were able to get more than 90% scaling when running from 128gb. So so far we've been talking about scaling up to multiple gpus in a single machine, but most likely you would want to scale even further out to multiple machines. Maybe with multiple gpus are perhaps just with CPUs. For these use cases you provide the multi worker mirror strategy. As the name suggests. This is very similar to the mirror strategies. Have you been talking about it? Also Implement controls training but this time across all the
different machines in your pastor. In order to reduce. Its uses a new type of up intensive local Collective Arts collective in the graph which can determine the best holidays to use face in a number of different factors such as the network topology the type of communication available between the different machines as well as if you have a lot of small cancers that you want to Advocate hit me back them up into a single tensor in order to reduce the load on the network. So, how can you use multi work Amir
strategy? It's actually very similar to a mere Sai Rishi. You create a strategy object like so and the rest of your training would actually Remains the Same. So committed here once you have the size you object you can put the rest of your coat inside you to scoop and you're good to go. One more thing you need to do. However in the muddy worker case is to give us information about your cluster and one way to do that from a co-worker strategy is using the TF fanfic environment variable to convict you might be having this environment variable. If you've used distributor training in SC using
estimator intense for 1 the first is the cluster which gives us information about your entire roster. So here we have two different workers at these host and port and the second piece is information about the specific workers are the sustained us is work at 1 Once you provide the city of config the top part would be different on each worker, but they're training code can remain the same and distribution charge. It will read the city of Concord and I'm in variable and figure out how to communicate to the different workers in your cluster. Suburban talked about multiple
machines multiple. Gpus. What about CPUs? You've probably heard about teachers in this conference somewhere else before I keep use are custom Hardware built by Google to activate you can use them to call T fuse or you can even find them out in collab. Distribution strategy provides tfue strategy for you to be able to scale up your training to a TPU in that implements and restraining and it uses the cross replica some API to do the oil reduce across the TV. Of course, you can use this API just gave your training to a single TPU or a slice
of a pod or a full pot as well. And if you want to try out to be strategy, you can't ride with tensorflow night, please or you can wait for the 2.1 stable release for this as well. Let's look at the code to see how you can use to be a strategy. The first thing you need to do is to provide information about the TV show Custer. So the way you do that is create a few cluster resolver and give it the name or address off your TPU ones. You have a class to resolve or you can connect you can use the experimental connector cluster API to connect to this cluster and also initialize the tip
your system. Once you've done these three things you can then create the strategy object and pass the Clusters other object to it. Once you have the right to object the rest of the code Remains the Same you create the strategy to sculpt and you put your painting curtains inside that so I won't be going to that here. Smoking a Discord you can see that he makes it really easy to switch from training onto multiple gpus are Muscle Machines to specialized Hardware like cheap used. That's all I was going to talk about screenshot IG to recap today. We talked about few
different things. We talked about tfdata to build simple and performant input High Plains. We talked about how you can you see a function to run your training as a tenth of a graph we talked about how you can use Excel a and makes decisions to even to improve your training speed even further. And finally we talked about The Destroyer strategy API to scale out your training to more Hardware. We have a number of resources on our website a number of guys and tutorial to encourage them to encourage you to check those out and if you have any questions or want to report bugs, please reach out to
us on GitHub. I will also be available in the expo hall after the talk. If you have more questions and finally The Notebook that we've been using in the talk is listed here. So if you want to take a picture, you can check it out later. That's all. Thank you so much for listening. In time so we can take them questions. Thanks for the nice presentation vacations bodyguarding inference. So he mentioned how we can use tensorboard for profiling to see how
are CPUs and gpus I utilize can we do the same to see how well are inferences are working. You're just running the forward pass the shelves all the time. And how are you doing in French? Like are you doing it using python if you guys are like are you and I want to use that intense aboard so you're not really using like any of the Python a pad right basically to Okay regarding Amex Position. Will it help in infants as well and how to enable light? It can potentially
typically for inference. The integer low Precision formats are tend to be more important than the big floating-point ones. If you really care about a very fast inference with low Precision, I would suggest that you check out to a flight to they have a lot of quantitation support. Thank you. Hello, I just with one question about how to save your model when you are training in a Distributive strategy because your coat is replicated in organelles. What is the weather today?
The TF. They've Mall Las Vegas and what to do with is not save the replicated model is Taylor single copy of the model. So then it's like any other state model reloaded. If you're loaded inside another charge you again, then you can then it'll get replicated at that point or you can use it for infants as another state model. We can put my experience was that old and I'll try to save them all at the same time and then it was an issue between a race condition with with the placement test. And so you got an error. Yes. It did.
But they could be bug. Thanks for tax. I have a quick question about the mix position and the XL a way of the API automatically you say hardware-related features, like for example remix Precision. If you have both our machine, it will be able to use multi machine Smith fishing because that's a lie or Sorry Miss Precision inserts a bunch of gas and then I will go along and actually try and it will actually optimize lot of this way. We'll try and make the
layout more amenable to mix Precision. So at the light tends to talk to the hardware at a very low level from use the point of view is transparent if I run the same code on the CPU on the automatically trying to use the hard way ability to thank you.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.