A product manager with 15+ years of diversified technology experience.Currently working on: Cloud, GPUs, machine learning, cloud, containers and kubernetes.View the profile
About the talk
Learn about our new Accelerator Optimized GCE VM Family, featuring NVIDIA’s new A100 GPUs.
This session provides both high level information about the new offering as well as detailed information about the hardware used, the VM capacities and VM shapes available. Learn how this hardware can be used for ML and HPC applications to drive innovation and improve performance.
Watch to learn ways to easily get started, scale your workloads and operate the infrastructure.
Speaker: Chris Kleban
Google Cloud Next ’20: OnAir → https://goo.gle/next2020
Subscribe to the GCP Channel → https://goo.gle/GCP
product: Compute Engine; fullname: Chris Kleban;
event: Google Cloud Next 2020; re_ty: Publish;
Hi, welcome, today's session where we talkin about the new accelerator, optimize Beyond family targeting machine learning HPC workloads. My name is Chris klieman, infrastructure-as-a-service managed service offering, no matter what the key to success is the right infrastructure, performance total cost of ownership reliability security, so on and so forth. To that end. We want to make sure the best GPU compute is enabled on Google Cloud. Not only that want to make it easy to get started. Easy to
scale, easy to optimize performance. Play, it's not just the infrastructure is all about your workload. It's about Machinery or rendering or HPC. Realization want to make sure we have the right products to fit your needs. We'll talk a little bit about that and talk about how committed we are cycle. Last year, we are very proud to announce that we were the first major Cloud to offer the tea for this is a very low-cost computer option with great performance for Prince visualization
or lot of scale-out compute. Everyday. I'm staggered by the growth product line. One customer customer large-scale simulations to ensure that driving cars are safe on the road using the GPS. How many other side and talk about the best performance. Historically, we've had the kids all the time in the last month. I'm super proud of you can come and get it today. So I know how to say that we go up to 16 of gpus in a single via and we'll talk a little bit of money aboard with special in dealing and
take special server Hardware. Google, computer infrastructure Services historically with a shop. One customer that you need the right side, right memory? Right Storage optional in cerberus's, what type of cheap you use very customizable. You continue with the N2, Laughing with right low cost. Cost off my 8 p.m. to be having families targeting specific ones. This is where the workload. What does this mean to succeed? Special Hardware configuration. Let's talk a little bit more about that.
We're talking about our family. That's talked about the key component of a 2 p.m. which is a new Nvidia. If this is a generational, leave a performance Pontiac speeds, 64 Precision /. 2x speed Improvement. This is used for scientific Computing. HPC workloads that you really need high accuracy. Sq32 kind of the de-facto ml training out of the box to turn it on. Make it go about adding optimizations birthday, 100 amp your technology. Really shines of the 10x. BD improvements or 20x. Simmons has optional sparsity feature.
Fp16 again to 5% increase in performance roughly Precision, 10 or 20 x? Performance center price increase on B100, so it's going to have the best tea Co for cheap. You powered A little more, the ts32 tensor cores tft32. Say a lot of code that just points at 3:32 with out of the gate switch over ti32. Do you spend a lot of time to make sure that for processing and for a lot of workload, it doesn't matter. Because the person for doing translation on a 100, the four different models for ML training
that use here and it's very flexible. Take the one that means your work. To talk about a new feature on a 100, the multi-instance. I'm really excited to be on a single cheap, you crave two partitions anywhere after 7, not only a run, the crates full size, lotion performance, guarantees it so that you don't have to worry about one app running into another Additionally, very common enterprise-focused buying mechanism is to commit to a long-term usage. I got three year committed
but there's some risk to the rescue is already told you. We're going to watch the best cheap you Hardware. There is So what if we launch a new hardware before your commitments out? This multi instant street view feature, no ways pewter proofs your investment. You might run a single workload for ML training on full gpus on 1, g p 16, whatever for years. But maybe your business changes. Do you want to run in France because you've done the research on the training collar. so you switch to use multi
use in its future proof and also increases overall efficiency of your No, I'm not a data scientist. So I'm just going to give you a high-level overview Source, but it turns out you don't need to process every single data point and every single calculation on your own network. That's the craft shows. Just do it. But by just not doing some again, doesn't impact accuracy of the model and get a 2X speed Improvement. Video is going to include in software and some features that make this easy to implement yourself. So altogether, a
lot of improvement, a100tps. Put pivot and talk about our a to VM family. So against the 40 gigabytes of memory is really need to keep you powered out of it. It's about the largest GB memory in this family of B100. P100 model to date. We offer 16 of these gpus in a single Beyond which means you can get a staggering 640. On top of that taped to a scale, a CPU from Intel with 96. 1.3 terabytes of memory, this is really important because truly has one some
customers come to us or they'll be Financial or ml or signs that I just have a too big of a data set. My GPU power need might be variable, but I need to have the VM to Big memories. The one, the memory size is large much larger than past products. Too. We have a 2X ratio, write 640 G of GB of memory sin, 2x a Time Beyond memory, that's because we want to allow for pre-processing popping into the GPU and high-performance the actual Warcraft. Google's going to be an AI expert and worked very closely with
those people to ensure Optimum performance is configuration We didn't stop there. You took invidious nvlink 3 technology. Inmate gpug few communication ultra-fast. We'll talk a little bit in the network attached storage or bring in server. Local SSD storage for fast. Virtualization, we spend a lot of time saying, how do we get you consistent performance? How do we optimize performance and how do we make it transparent way that your application could take the excuse of a lot of
improvements on scheduling to get those characteristics for you? Chris Gayle at work, was we spent a lot of time on the network stack you'll see that family comes with two flavors, the high GPU flavor in the Megan VP of player. Use 96bcd CPUs, Also comes with a hundred gigabytes of network performance, so if you want to scale out or is it is really quite the high GPA is a nice flavor to half. But for the customers who want the ultimate GPU parallel performance of a 16gb
fuse, networking, local SD Highway. On both of these two flavors for scale-out workload. We spent a lot of time optimizing the library. If you just take the open-source out of the box library that enables BMW M3 navigation. Versus what we have an optimized version of that. There's a 70% or up to 70% for these Network test. When you're scaling your workload to 1 p.m. to the ends, 10 p.m. that's really important. Facebook, a little deeper on a date to make a cheap UVM instance.
Nvidia and deliver these HD export There's a cheap use a 100 per board and there's a high-speed embeuling fabric line for cheap used cheap communication on a single HDX. We've worked closely with Nvidia to provide a server platform that combines to a ship's words with a very tightly intercostals emulate fabric. Now, you have 16 gpus with all the all Indian cartoon with 9.6 per second. Why that's important is allows you to couple of things. One, I was getting used to 640. G
of GPU memory as a unified memory fabric through support of the Cuda tool sets. Also, Ray does Naples really linearly scale performance. In terms of scale out. And I talked about the performance, would you have multiple of that makes Bertrand go very fast. Also works very well for a lot of each PC work. In terms of performance cams, for SP 1610 petaflop, the reformed is in a single via Ops 4 in these numbers. Baby army must be, they mean something to you.
Mark Foreman really good. Tea Co for the same amount of budget. You can get a lot more work done, which means a lot more Innovation or enabling you to focus on the things. In terms of applications and video spend a lot of time working on. How do I enable more more compute in our gcp? Marketplace or Nvidia, GPU is pre-packaged containers with tons of applications that are ready to go. You don't need to worry about stowing, the drivers of the applications, just get the container for it on Google Cloud right on the 80 to any of our chief use.
We've been talking a lot about that bottoms back on this picture, the infrastructure layer we're committed to providing the best infrastructure no matter what, no matter what your choice of hardware. Up above stack. If you don't want to worry about the infrastructure on GCE or including the engine about, you do ml training, the service predictions or anything like that. Just have your data scientist, use a little tool that bastard powered by the 2 p.m. to talk about. Going up a sack in the layer one more time in automl.
This is where you can create your business based on your data. Don't have no pipe. Torch is a note answer flow, you bring your data but I still work for you. Or is a bunch of apis that developers quickly access the development language, their choice using a series of pre train models. Things like Vision, translation to start super quick, including Let's talk a little bit more about performance Improvement and I talked before about how many operations of second can the thing to do, but really what matters is the workload?
Super ml training. One of the most demanding applications language is a very popular one right now. And you can see a 3X or 6X improvement over the previous generation. We are all in trouble. Our customers are also reporting using fp16 real-world scenarios. Very close to that three Improvement. I believe I saw it was a 2.8 last. I checked these numbers provided by Nvidia, we are seeing real similar things for customers. I'm in France. Again, a 100, big step forward, overview, 100, T4 still a great product but sometimes you
want a little bit more bandwidth on their friends for batch workload, you use 7 of a cheapy partitions. To get really a large amount of interest work clothes, and perhaps it's reinforcement learning. What does using a model to do games? Simulation and the same time training model as well to both of these are really important, right? For these large caliber So hopefully it should be very clear. Why? And how is h u b? M, i n g. C is going to be high performance
and while we haven't enough pricing at trust me, it's going to be the best tea Co on Google Play. Let's talk a little bit about gpu-accelerated actually Spark. It would spark is a framework as a tool set of lot of data scientist analyst, use the software in the Old World Part 2 in pipe, a bunch of data in process that data. format, it may be removed sky-high information you put in storage Then you have to have another cluster of configuration Hardware at, would you and ml model training in HPC job. Video and Google work together
to help speed up that by bringing GPU power to both model training as well as data preparation, spark 3. Supports both of this through Rapids. And now Google managed service out of Rock Sports Park 3 Oak clusters and all 6:40 p.m. product offerings for an infrastructure perspective. Really looking forward to seeing how customers can use to use really large, a 100 GM right for data preparation tomorrow. We talked a lot about a 100. We talked about a i and if you see we're close, talking about performance, If the newest thing very high demand
is alphys available today. I was a sign up form. Please sign up. But it's not. The only thing we have five other GPS all doing the wide range of services, Weatherby animal training room, friends rendering, computations visualization. Well, we have all of these and you're happy, you're welcome to use any of them. I want to talk about the tea for a little bit. since the launch we've noticed is there's a very high demand for a VM, maybe a variable amount of CPU and memory and some
because of that, I've seen a lot of growth more growth and you expect Social media companies, self-driving, car simulations. I want to talk a little bit more interest per second. TCP things that E4 plus are spread region availability, plus our Network. Go to Lansing airport is really important. Is a lot of user experiences have predictions. A I built into them interacting with ar pictures or if you are getting as you want those as specifically recommended. or
you getting started with the team as a high intelligence building, if it's slow Maggie, Ultra low-latency for the temperance, just 1 millisecond, you combine that with our Global availability spread around the world, we're going to have to pass the closest user base by population. And the third major component of this is our network network background myself and I'm very familiar with. There's two things that really surprised me about Google's Network and why do healthcare investment fiber optic cable,
at first for ultra-low, how do you get Google Search to your web browser as fast as we can video streaming? That's a network that optimized for that is what powers Google cloud and one feature specifically routing in a sense of we don't want to get traffic but on the internet as quickly as possible. Regardless of packet, loss of issues we optimize, how do we how do we watch the rectifier that's folded over on Fiverr? Their own networks as long as possible today in user? So the combination of
great Network to a very high-quality research grants for real-time different work, Let's talk about 3. I'm supposed to use in essence is our spare capacity and your cost savings. Give you a lot of reasons why we have spare capacity. Maybe we're building ahead of time before demand or other users are running back. Work clothes at one moment. Either way, there were moments, we have a spare engine for future. I'll talk about next and how do I do more of the same deer processing,
hundreds of thousands of music files to music files 70%. Is it because you can't always guarantee that you will be here. So you got to have flexible completion times. In this case. It worked out, please go check out and thank you for the customer store. Really appreciate that. To that end. I talked about not only having the best gpus making easy in easy to scale. Previous example was on kubernetes, but perhaps use VM long time in advance that allow you to scale up and down capacity.
We have a new feature that lets you find spare capacity, call the distribution shaitani. So instead of worrying about how much gas you have wear and working with our capacity, team to create yourself and set your targeted region sets at Target, and allow us to find those p.m. copy for you specifically useful. In the fridge, I'm excited to see the usage of this so far. It's talk about another exciting way to make it easy to get started and to get hyper form.
The Deep learning environment. When I first started the product manager, a few years ago, and gpus, I do like normal product manager to try the product to listen, to users of the things way back, then, I was surprised to find out easy. It was To use the infrastructure created. How hard it was for the software pytorch vs tensorflow. A series of applications and a package is being or containers. Not only that nickel Library talked about before. To get started. Easy to scallop.
Let's talk a little bit about tensorflow, Enterprise for some customers who perhaps one enterprise-grade support or just those clouds Gala performance optimizations, these dlbm, or containers that we talked about before or there for you. Not only that he's worked hard to get access to tf1 in TF2 101 today. That's not Play-Doh. I'm really excited to see that and Ark Alpha customers already running. And the rapper today's talk to show you how simple it is to get started with all the things and say I have a tensor flow to a run
for Notebook, speed up, a cheap Cloud. Come and pick the new 8 p.m. family size 1 GPU container, that has everything I needed a performance optimization software A tender for structure. That connects the jupyter notebook. I get started. I want to thank you for listening to today. I'm learning about the A to B, M Family, new line, gc110 courage you to sign up and get started right away our public available. You will be soon. Thank you. Have a good day.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.