Zak Stone is the product manager for TensorFlow and Cloud TPUs (Tensor Processing Units) on the Google Brain team. He is interested in making hardware acceleration for machine learning universally accessible and useful. He also enjoys interacting with TensorFlow's vibrant open-source community. Prior to joining Google, Zak earned a PhD in Computer Vision and founded a mobile-focused deep learning startup that was acquired by Apple. While at Apple, Zak contributed to the on-device face identification technolView the profile
About the talk
Cloud Tensor Processing Units (TPUs ) enable machine learning engineers and researchers to accelerate TensorFlow workloads with Google-designed supercomputers on Google Cloud Platform. This talk will include the latest Cloud TPU performance numbers and survey the many different ways you can use a Cloud TPU today - for image classification, object detection, machine translation, language modeling, sentiment analysis, speech recognition, and more. You'll also get a sneak peak at the road ahead.
Hello everyone and welcome to effective machine learning with Cloud CPUs. I'm delighted to see all of you here. And I'd like to send a special welcome to everyone on the live stream or anyone who's watching. This is a recording later. I'm Zach Stone on the product manager for tensorflow and clouds. If you use on the Google brain team and this talk is about super computers for machine learning that are built around Google's custom chips called tensor Processing Unit or tpus. We've had three generations of tensor processing units including the one on the bottom that Sundar announcement of
keynote earlier this morning before I get into the details here. I just want to set the stage for a moment and share a little bit of context about why this matters there's been a tremendous amount of machine learning progress over the past several years as you've already heard earlier today. If you look at the number of research Publications on archive, which is a open site for sharing papers. The number of papers published is growing faster than Moore's Law and we're up to something like 50 new machine learning papers everyday. This is a tremendous rate of innovation. A lot of it's happening
in the open which is fantastic and it's driving a lot of new applications and real progress across a wide range of feel as you can see here in just one domain computer vision. There's been a tremendous increase in accuracy on this image that Benchmark challenge just over the past few years rising from the original breakthrough results with Alex net all the way to some of the more recent machine learning models, but these accuracy Trends come at a cost that I'll tell you about it a minute. First I want to help you understand what those sort of applications are about. You've heard a
little bit about diabetic retinopathy and also these new signals of heart help that you can get in a completely passed away just by taking a picture of the back of the eye. This is the kind of real-world application. It's going to help improve people's lives. That's only possible because of these cutting-edge accurate powerful models in machine learning. You've also probably heard about tracking illegal logging in the rainforest by putting recycle cell phones and trees that can listen for sounds of chainsaws or other activities and help people protect rainforest in the central Amazon.
I know one thing that's really changed even since I was in graduate school is this unification across many different application demands before it might have been that the folks in computer vision had one set of techniques the folks in speech recognition had a different set of techniques and the folks that machine translation. We're doing yet a third set of techniques to go from, you know, one language to another but we've been seeing over the past few years is this consolidation especially in deep learning around these neural networks that have similar components
even if the details of their structure are different and these neural networks, and now it's even state-of-the-art results across all these different tasks often. These results are inaccuracies that we don't know how to achieve any other way. As I mentioned earlier, this has come at a bit of a cost which is as these models get more accurate. They tend to be larger trained on larger data sets. And that means that you need more computation supposed to train the model on the data set and then eventually to run it what I'm showing here is this plot of these image recognition models on
imagenet this time ranked with accuracy on the vertical axis. And then the number of X ads which is sort of a rough metric of computational cost on the on the horizontal axis. And as you can see as you get to these higher and higher accuracies, you're actually requiring larger and larger amounts of computation. So the solution to this problem we believe Is specialized hardware for machine learning and that's what's really driven us to develop these multiple generations of tensor processing units or tpus. Our first diffused have been in our data center since 2015 and you've interacted with
these every time you run a search there re-ranking the last several hundred links to choose the 10 that they're showing you their effective an essential part of Google photos and speech recognition and many of Google's large-scale applications that you use everyday. Last year, we were filled our second generation tensor Processing Unit, which are now in public beta anyone can sign up any gcp project can start a cloud CPU and then you can follow the link G. Co / quality for you to learn more. These people use aren't just for training an inference on a single system though there really
designed to be connected together into these large-scale supercomputers that we call TPU pods and later CR2 coupons are coming the cloud as positive you pods here's a full pot of the second generation T fuse on the screen right now. Finally this morning, you're getting the first glimpse of our third generation teepee used assembled into an even larger TPU pod that delivers more than a hundred petaflops a compute for a single machine learning problem. I just want to emphasize that we are committed to making Relentless progress in this domain
this affects all of our products and improve the lives of all of our users and through the cloud. We're bringing these platforms to you. So you can train machine learning models and run them faster and more cost-effectively than ever before. Just a few notes on performance. You can see CPU be one with 92 Tara Ops only doing inference then going to the cloud TPU which is 180 teraflops. Suddenly have floating point. It's a lot easier to work with you don't have to worry about quantization. And those are connected together into these systems that as I mentioned before I'm 11 and a half pound of
flops and do training and inference on a single problem or you can slice them into smaller pieces to do many different problems at once. Just a year later. Now. We're making the sleep from about 11 and 1/2 petaflops the more than a hundred that's more than a tax the performance of a TPU V2 pod in this movie 3 system, and that's a combination of an entirely new chip wire together into an even larger scale system. so I hope the reason that you're all here is because you're interested in expanding the AI Frontier. That's what really motivates us. When we're
developing these platforms and building applications on top of them. Now I'd really like to focus on what you can do right now with the cloud TV use that are available today and that really starts with performance. A lot of these machine learning models can take days or weeks to train. It hasn't been uncommon in the password to cost thousands of dollars just to do one training run of this machine learning models and we're really interested in bringing down those cost and making machine learning acceleration much more widely available. But when were talking about
performance it's important to be very specific about what we mean because it's quite subtle sometimes to compare one system to another First of all in all of our measurements we try to focus on real-world data time to accuracy and cost. So let me tell you what I mean when your training a machine learning model in your processing millions of images. It's convenient to look at how many samples of second your processing as a measure of performance but those samples per second numbers don't matter unless you ultimately get where you need to go reach the
correct final accuracy. And so what we're really looking to do is yes measure how fast the wheels are spinning on the car, but also make sure that the car gets to the right destination and crosses the finish line. So it's very important whenever you're considering building buying renting a machine Learning System to make sure that you're asking about not only steps per second or samples per second. But also the training accuracy and the total time to accomplish the tasks that you really have in mind. We've also put in a lot of energy to try and make sure that
all these ml benchmarks that we talked about are reproducible the open-source. So almost all the numbers that I'm about to show you either you can reproduce today by going to do not cost much cost if you and renting a car GPU or they'll be open source tune in so you'll be able to reproduce them on your own. You don't just have to take my word for it. No Stanford recently hosted a competition called on vent. I got a screenshot of the website here. And Donovan is the first public Benchmark challenge that we've seen really measure performance and accuracy together, which we think was great.
There's a different dissipation from many different companies and individuals that research groups and I'm happy to announce that cloud CPUs and tfue pods did very well across several of the category. So in the image not training cost competition. We have the number ones on bench entry with a model called amoeba. Netd know. This is a really interesting Fusion of some of the work that we've been doing research with these new hardware platforms because I'm even that was an architecture that was actually involved from scratch on CPUs. And then that model not only if you train it
longer can get to state-of-the-art accuracy on imagenet, but it can hit the target that done been set of 93% in just seven and a half hours for under $50. This really makes this kind of state-of-the-art performance accessible to a much wider audience than it's ever been before. No, another model that you may have heard of that's frequently benchmarked is resnet50. And so I just wanted to show a quick case study of resnet50 running with tensorflow 1.8, which is her most recent version. We also submitted this turn on bench and it also reaches this accuracy of 93% which is
not this is top 5 accuracy that's not easy to do with ResMed. It's easy to fall just a little bit short and you have to be very careful throughout the implementation to achieve the high accuracy while maintaining this incredible performance of more than 30 to 50 images for second. Now this took almost nine hours to train for $59. But if you want to go faster part of the reason that we design these systems to be connected together is just by changing the bat size of the model. No code changes. No complicated thinking about distributed systems. We can run the same thing on
half of one of these tfue pods to hit the same level of accuracy and just 30 minutes. So that's 9 hours down to 30 minutes. And these are the pods that are coming the cloud later this year. So you'll be able to run results like this on your own or train on other models with other data sets of your own and if you eliminate the checkpoint overhead, which convents didn't actually requires to measure that's actually just 24 minutes, so What I'd like to mention know is the field is moving extremely fast. So just as an example, the third play Centre for training cost on Don bench was
from a nonprofit organization called Fast. II and then she came in at $72 but it included some really interesting ideas. That weren't well known one was this Progressive scaling-up of training images that enabled the model to move even faster in the earlier phases of training when it's just starting to figure out how to complete the task and they also use them much more aggressive learning rate scheduled. It hadn't been widely used. But I think about positive use it was easy for us to to make these open changes in our own model unplug CPU and unofficially rerun the experiment and these
two simple changes take our resident 50 training cost from $59 down to just 25 with today's on-demand pricing that you can get on Google Cloud. So there's an open source reference implementation available. We built this into rb4016 version resnet50 and we'll have more to say about this soon. But the faster they died for the Great Outdoor in the redneck Improvement, and I'm just delighted as a researcher and practitioner and now product manager that so much of his progress is widely shared. So you might be wondering what what's the next Benchmark challenge gone bench finish recently.
Where do we go next to compare these very different systems of different sizes and scales and even different model architectures against one another if we need to make a decision about what to do when you want to accomplish a goal. New challenge that's been released recently is ml. / you can learn more about it at ml perf. Org there are bunch of researchers from different top universities and also many companies coming together to try and Define a new next-generation Benchmark. It's even more challenging than Don bench and covers many more problem
domains 1st deadlines going to be in July this group is, you know opens new collaborators and also planning to iterate on the definition of The Benchmark to make sure that it really tracks where the field is going. So I think this is going to be really exciting later this year and I encourage you to participate and check it out. And as I mentioned before I just want to provide a little bit more information about what you can do today with hugs. If you use if you want to experience this kind of performance, we've got two set of reference models. I'll say more about them in a minute, but
they're right there on GitHub tensorflow GPU and we've also got great documentation on the cloud TPU website with the link below, but you don't have to remember the big Links at the top just fly spot if you will take you where you need to go. So now I'd like to give you a sense of what it's like Hands-On to use these clouds reviews. First of all one thing that's a little different than other accelerators isn't there network-attached? So what that means is you can choose a virtual machine of any size or shape you can you know attached to it if you want it can be
large but you would generally start with a great time of year because this is really playing just a coordinating roll, you know, his fuse for Cores are often enough because all the real computational work is happening on the cloud GPU the nice thing about this setup is that cloud GPU can be a single device or it can be a slice of one of those pods. You don't have to worry about it. All that's handled for you for the more it you don't have to worry about messing with drivers. You can just use these machine images that we provide. You program tents that you program these pots reviews with
tensorflow has the most popular open-source framework for machine learning one of the top projects on GitHub. And it's just taken off since it was open source in November of 2015. We're now up to something like 13 million downloads all around the world. Tensorflow supports many different kinds of machine learning and has many different components when your programming GIF use today, you'll focus on layers estimators the TF. Data fast input type lines, and then your code will be transformed behind the scenes by the Excel compiler to talk to the CPU. Estimators in layers really
provide these flexible high-level deep learning abstractions for tensorflow that let you map a wide variety of different models to CPUs. Here's just a quick code sample of a convolutional neural network. And this is how you would write it with layers and estimators is some code a lighted what you can see here is that you just have to make some minimal change the map this over to the TPU changes in the optimizer in the estimator and we're working overtime to make these changes smaller and smaller so that it will be really convenient for you to take tensorflow code that runs on
CPUs and gpus run it on TV use move back and forth wherever you need to take your code and train or run. One important bit of advice though that I'd like to just emphasize over and over. High performance Computing is hard on any platform are so fast that you need to really put in a lot of energy to make sure that the whole system is balanced and that every part of it is performing as well as it possibly can we've done a tremendous amount of this work for you with these reference model that I'm about to show you that we've got open source on GitHub. For example,
you're a four different categories a very different applications that are either supported today or are about to be open source for cloud TPU over in the image recognition Colin, which is may be best known we've got an even that and resnet50 also bigger versions of it Inception V2 V3 V4, but we can also do object detection with retinanet, which is another one of these recent state-of-the-art models and I'll say more about that in a minute. Also if you're interested in training on machine learning model and then deployment of the edge on a phone or a robot or a drone or in a
vehicle or somewhere else you might want to train a smaller model with something like mobile met her squeeze that and clubs reviews support those as well. But we're not limited to just image recognition or object detection. Keep you can also do machine translation and language modeling sentiment analysis question answering all with this state-of-the-art model architecture all Transformer. That's very powerful and flexible. Print out a version of Transformer can also perform fantastic speech recognition and we just recently open source and example of speech recognition training on cloud TV
use as part of the tensor2tensor friend Mark tell me more about that in a minute. And finally for those of you who are researchers out there and are at The Cutting Edge where you're generating images from scratch over in the image generation category. We've got the image and also DC game. So we really got a huge range of applications that are not only possible to make work on the cloud to view, but they're working today with open source code that you can just download. Really easy way to get started. So an image classification is often taking these small images
like this one and mapping that to a category like Lion, where is object detection is taking a larger images and looking for many different objects in many different places in the image at the cool thing about our resident implementation is it lets you process quite large images 89066 Trainz 12, very interesting accuracy on the Coco dataset 37 average Precision in just 6 hours for $40. Machine translation training on the WMT dataset gets the Blues score of 28.6 which is very close to the state-of-the-art again and just 6.2 hours for $41.
And these results are changing all the time as we continue to optimize the platform the implementation and new research comes out much of which at Google is conducted on the AC fuse Now language modeling you're trying to predict the next to word this is an important component of applications like smart reply and again, you can take the language modeling the lm1 be the 1 billion words that end train to a state-of-the-art perplexity of 28 and 74 hours for under $500 and I've mentioned tensor2tensor again below that's this open source project built on top of tensorflow that has some of
the latest and greatest results for all of the sequins and sequence models. The speech recognition we got this model Caldera Sr Transformer, which is training and labor speech again to a really interesting word error rate. This is a fantastic way. If you have a large labeled speech dataset to just train your own near state-of-the-art speech recognition model from scratch in whatever language or accent you care about and you can you can get for a reasonable accuracy for under $100. Finally question-answering here with you ain't at this is just the fast version that we used to win the question
answering competition for dog bench. But there's a Stanford question-and-answer dataset that takes a big chunk of text and then presents a question about some facts that you can infer from reading the text. So this machine learning model is actually looking at this whole paragraph of unstructured text reading the question and then pinpointing the answer is really impressive. It's really different from image recognition and all this is possible on this sort of flexible new powerful platform for machine learning. I forgot that I was going to tell you a little bit more about image generation.
So the metric here is actually bits for Dimension and lower is better state-of-the-art is around 2.92 right now, but under 3 is really good and you can get under 3 on this metric in 30 hours for under $200. If you're interested in exploring this state-of-the-art cutting-edge image generation examples. All those images were generated from scratch. Just dreaming on the basis of a training set using this image Transformer model. So again, why start with a reference model instead of just opening a code editor and starting from scratch or high performance? They're open source
their Cutting Edge Austin collecting input from researchers all across alphabet. We continuously test them for performance and accuracy, which means that you don't have to worry that these some subtle change in tensorflow or some other part of the stack might might reduce the accuracy Without You realizing it you can get up and running really quickly and then modify is needed because you have the code it's yours on your infrastructure and you can train and run on your own data. but you're perfectly welcome to do it the hard way with lots of detail technical advice in our troubleshooting
guide and you know, I'm not kidding that high performance Computing is hard and any part of the system can become a bottleneck. But we totally encourage you to to take the high-level layers and estimators apis and build new models and let us know what works for you. Just to give you a little bit of a sense of what makes high performance Computing. So tricky, you know, if your training with floor Hardware, you might find that you can overlap your input be processing with your training time. And so your accelerators are always saturated, but when the accelerators get a lot faster
what you find all of a sudden is that your bottlenecks on the input processing in your accelerators are sitting idle and so Fortunately, we provided a bunch of tools that are connected into tensorboard, which is this great visualization framework associated with tensorflow to help you debug your performance understand what's going on here. I'm showing the steps per second in the loss, but we actually have this very detailed profiler that lets you go in and get stopped by cop statistics on what's going on in your model and we even have high level recommendations. I don't know if
you can read the read text but it's telling you in this case your model is not input bound. So you should spend more time on optimizing the step performance, you know, in other cases, it'll tell you you're blocked. Don't bother improving the speed of your model. You need a faster input Pipeline and that's where TF data can help. So we're really doing our best to prevent to to present these tools for experts and all this documentation for experts but also to make it really easy to get started with these reference models across the huge range of applications. No, those are you for used
to training and running machine learning models under your desk? You might wonder well, why should I jump to the cloud that seems like a different platform? How do I think about that? The nice thing about training in the cloud is that you can access to all this top-notch infrastructure where you can mix and match the different components. You need like those custom DM shapes. I was telling you about earlier you can scale up and down quickly. You don't need to order Hardware maybe build a machine yourself wait for months sometimes for Hardware to arrive. You can just sign on today and
scale-up 250 fuse if you wish there's great security Ultra Google Cloud Google is state-of-the-art and security across the entire platform for Enterprises that want cutting-edge machine learning hardware and you're really able to reduce your capital expenditure, which is on demand pricing. You're just paying only for what you use you can train one of these models for 50 bucks and then turn the couch if you off and you're not paying for it anymore, which is great. Just a few words from two of our customers for clubs to use two Sigma is a financial firm and lift.
This is their self-driving vehicle team. Both of them have had great experiences using KT Tape to use to focus on building the models instead of worrying about these massive distributed clusters of other hardware and taking these models that used to train for four days and some Came Down 2 hours and you know with these larger and larger pause, we're really trying to drive that 2 minutes. So let's go under the hood and talk in a little bit more detail about what's going on in the current generation of clouds review. First of all, the cloud GPU actually includes a host with this
accelerator all behind the scenes connected via PCI each College if your device has these for chips under the heatsinks here and I'm going to zoom in on one of them. If you look at the couch, if you chip layout, you'll see that it's broken down into two cores that look like this. Now. These cores are really built around the systolic array which is an old idea but its newly relevant in this age of machine learning when so many of the computations really look like I can take Matrix multiplies, but we've also got a vector unit and a scalar unit. And in this case it was you know,
8 gigabytes of high bandwidth memory procore and then they're two chords for chip the new tfue V3 actually doubles the memory for chip which is going to let you train an even larger and more capable model. Focusing on this Matrix unit the systolic array one thing that's really interesting about it is that we're actually doing the multiplies with a new floating Point format. That's called default 16. I'll say more about that in a minute, but it really gives a discrete performance for machine Learning Without a meaningful cost and accuracy. We're so humiliating and Float 32, though. So
what's that? You're probably familiar with 32-bit floating-point. As you can see here, the the orange benefits of it all devoted to measuring These Fine differences between numbers which is really important. If you're doing high performance Computing detailed simulations that kind of thing, but in machine learning it turns out that the range is much more important and his machine learning models are not super sensitive to these fine differences by and large. So the IEEE standard for a 16 bit floating Point as you can see here and it
really cuts down the range of numbers that you can express. One thing that we find. We're finding with tensorflow as if your training with half Precision IEEE like this, you really have to pay attention and you need to get an expert involved on a Model by model basis to make sure that all of your numbers stay in the right range and don't overflow and those techniques are difficult to apply in general across a ride when you're models. With people at 16 the what we've done instead is Save-A-Lot more bits for the exponent which preserves the rain and so in many models. This is a drop-in
replacement for float 32 and it's supported seamlessly. I called you a few Hardware. So those resnet50 results and I was talking about earlier those were achieved with Fifa 16. And as you noticed we're still achieving that high, you know, 76% top one or 93% top 5 accuracy with a minimum of effort and special casing on a Model by model bases. Carol show you a brief animation of the systolic array in this slide. It's only 3 by 3. But imagine this is 128 by 128 to get a sense of how different this is from an ordinary central processing unit and how much parallelism there is
which is where we're getting this fantastic performance. So here you have data streaming in and what's happening as the systolic array sort of Beats on every step is you're getting great reviews of these intermediate results as the day does flowing through the systolic Ray and generating the outputs. So this is super high throughput for matrix multiplication and it's tightly integrated with the other parts of our view that support all these other custom operations and and are actually compiler also does a great job of taking high-level mathematical expression and then mapping them
down into this coordinated dance of low-level operations that gives this tremendous amount of speed. So where do we go from here? Well First of all, I really want to emphasize something. I haven't. Call. Explicitly about scaling up. I think the days of single systems are irrelevant. They're over because what you care about ultimately it's how fast you get to your result. Right? And so what we're trying to do with these pubg for use and making them widely available in the cloud is giving you just a dial
that you can turn where you do your prototyping on an inexpensive single device. And then you increase the batch size without doing any other code changes and suddenly your training time is going down from hours to minutes. And taupe, I'll show you to you know, real measurements from our computer systems and how is here to give you a sense of how that's possible. First of all what I'm demonstrating here if you want to be a few devices, so the second generation few pothead 64 of these boards in it and you can see linear scaling of images per second into the right here.
And as you know, remember if you hear me just a second place, but wait, what about the accuracy you can check card on bench results that 30-minute results also confirms that we can still get the high accuracy with this scale out. But what you can see is that are interconnected his pot is so fast that you don't have to struggle the way you would with mixing and matching Hardware on your own weather on premise or in the cloud. You can just turn this dial with no code changes and benefit from the super-fast proprietary Network. There's a more subtle Point here. That's a mathematical issue
which is until recently. It hasn't been clear how to take these machine learning models and increase the batch size which is how many of these images your processing simultaneously sort of an each step while you're training. There's been a huge amount of research energy on this recently and there's been a lot of breakthroughs where we're now able to train effectively on much larger baptizes from ever before that's great news for scaling up with the CPU pause. Let me help you interpret this graph. What you see on the horizontal axis here is time and then what you see on the the vertical axis
is accuracy. And so you might imagine, you know, your first implementation on your part. If you that you're renting today gets to let's say 76% accuracy on resnet50 and that takes a certain amount of time that's not orange curd out there. But with no code changes just by changing the baptized in the right way with the models that we've already open sores. You can pull that time in as you're using larger and larger sections of the Pod, you know until you get down to these these, you know, 30-minute results or possibly even better. So we think this is an extremely promising research
Direction and we're doing our best in all these reference implementation to make sure that it's immediately accessible to you. So once these pods become more widely available in Cloud, then you'll be able to just transparently benefit from this large batch training. I'd love to live in a world where you can use an extra flap for a minute and then let it go and think for a while while somebody else is using it and just swap Looking for us. Here's another view of the full TPU V3. Pod that Sundar showed you this morning and we are doing everything
that we can to push the boundaries of performance and that's what's really let us to liquid cool this pod in comparison to everything that we've done previously that are cooled. If you want a cool close-up view here, it is a close-up view of just one of the boards in that pod. It looks sort of like this and you can see the tubing you know, now those plates are on each one of these next-generation tfue vv3 processors. So we're going to go to the largest patch size as we can. You know, we're publishing all this research that we're doing. We're open sourcing as much of it as we can on these how
to view platforms that you can access today. And so we're really excited and going with you into this exciting AI Frontier you can get started right now with these high performance reference models in tutorials, and please reach out to us. Let us know we'd be really interested in hearing your feedback on the couch. If you pop from up today would love to know where you'd like to go. We hope you enjoy trying all these different applications across image recognition in speech recognition and image generation and machine translation, and I'd be happy to take
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.