Duration 27:55
16+
Play
Video

Ensuring business continuity at times of uncertainty and digital-only business with GKE

Kobi Magnez
Product Management at Google
+ 1 speaker
  • Video
  • Table of contents
  • Video
Google Cloud Next 2020
July 14, 2020, Online, San Francisco, CA, USA
Google Cloud Next 2020
Request Q&A
Video
Ensuring business continuity at times of uncertainty and digital-only business with GKE
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Add to favorites
275
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speakers

Kobi Magnez
Product Management at Google
Donovan Carter
Sr. Devops Engineer at Dexcom

Seasoned Product and Technology Management Executive with years of successful experience in digital channels and advanced technology. Business Development experience with industry leaders in the market. Successful partnering and execution of global projects and contracts. Tech-savvy and visionary thinker. Experienced with Cloud Computing, Big Data, Financial Technology, Payments, and Mobile Solutions.

View the profile

I'm passionate about enhancing the developer experience and providing insight and perspective from an infrastructure and back end perspective. DevOps is a passion, not a role, to me.

View the profile

About the talk

Join to learn how to achieve high availability and reliability for applications deployed on Kubernetes clusters. Listen to recommendations and best practices that are beneficial to manage and weather all kinds of changes. These are specially worth considering now given that many people are quickly transitioning businesses into a digital only world to reduce the impact of COVID-19.

Speakers: Kobi Magnezi, Donovan Carter,

Watch more:

Google Cloud Next ’20: OnAir → https://goo.gle/next2020

Subscribe to the GCP Channel → https://goo.gle/GCP

#GoogleCloudNext

APP311

product: Anthos; fullname: Kobi Magnezi;

event: Google Cloud Next 2020; re_ty: Publish;

Share

Hello. And thank you for joining us today. My name is Coby. My messy. I'm product manager for Jackie. And I have the pleasure to have you here with me today. Donovan Carter. Senior devops engineer Dexcom. It's a pleasure to be here beside. My name is Donovan Carter, I'm going to senior devops engineer with Dexcom. I've been working with Dexcom in the devops team on our platform team for the last three years. And it's been an honor to get to know Kobe's to our mutual interest of Google, kubernetes engine

play the bility and business continuity, with cheeky. But first, let's define what business continuity in means. It means having a proper plan preparation and mitigation strategy in place. So, in case, something happens with the potential to disrupt your business, your organization will have the ability to sustain its critical applications will bring this discussion to two parts. They zero and a 2in for each one will provide some key recommendation. Am I text on this is to discuss. What Dexcom how Dexcom leverages this business continuity approach? For those of you may not be aware text,

Tom is a industry leader in continuous glucose monitoring space. What that means is for people with type 1 diabetes, we give them a platform of services that they can use to book keep in touch with their own disease or the disease of loved one as well as far as monitoring levels of their actual glucose blood glucose. Because of that, we have chosen kubernetes and Google kubernetes engine specifically because of the needs of our customers. Being what they are, we have to provide that service to them in a highly available way and so obvious choice was to go with Google

kubernetes engine. Vacation more than Ivana, why you decided to use a managed service instead of Pinot, using your own kubernetes and install it on your way over your environment. Absolutely comes from experience. We were using another cloud provider prior to our migration, to Google a few years ago. And I'm a cloud provider we had made the decisions. You essentially de palero in kubernetes service within the cloud found out is that it's really hard to do that if it requires a lot of operations and we're going to get into that today. And we decided not to invest that

the monitoring be alerting. All the things that go into running your own presented itself, is not only an industry leader in a contributor to the kubernetes, but also provides it. As a managed service, which do I still need to be the things that we have to deal with on a day-to-day was very, very attractive. And so far we've been really, really excited about that. The other added bonus, is it scaled? Ridiculously well, and we don't have to do anything. We don't really have to think about it. It it's been a life-changer for us to go from where we

were several years ago to where we are now. And it's exciting to see where we're going to go and future awesome. And we have four days zero. And when thinking about how your availability of Gigi Glasser, it is important to carefully plan this apology and configuration of your GED classes, before you do your work clothes, let's look quickly on the GT architecture and main components when you provision, you kubernetes cluster, G key. Provisions. The machine the underlying GCA resources that are required for the control plane and a note. We also managed to roll out of the

Cuban artist version and patches and their availability in the GK, elite team manages the control plane, including all the components running on things like a bi server at city and others. We also managed the software that runs on the note that includes operating system image to bleed and others. When you create a cluster, you can decide on the right apology for your business, you can choose to create as Donald plaster which includes 1 replica of your control thing. And then you can decide whether you want to have all your notes spread across three different zones or whether you want

to keep them all in one zone. With the control thing that's really depends on your knees from the environment is to create a regional cluster for which cheeky Provisions, three machines for the control time and the Bulls are also span across 3 availability Stones, the control plane redundancy provide a high visibility control plane which interns offer higher Escalade for the coronavirus a p. I Even though you were busy supplication, those that run on the notes will continue to run and search traffic. When the control planes down. Temporarily, for example, for maintenance, it is

recommended to use Regional class therefore highly available workloads is responsible to observe scale and repair your cluster and when higher than abilities required regional class, there should be preferred Whichever topology, you end up using GT off-road fully managed experience for the control plane. That means if we take care of backups, repairing the control plane, if something happened. We make sure it's available through monitoring Health, chicks in Auto Repair. We also keep you up-to-date and we operate your control planes to the latest kubernetes version and patch in security

patch available in a reliable Manner. And most importantly, you get Google, Google is the rain to Monitor and manage your infrastructure for you. No. It's our shared responsibility. That means that while they control plane is hosting a Google project. The nose around in your project in the customer project that being said, UK offers a variety of capabilities that automated lifecycle management of the notes as well and will cover those next. Decided on the closet apology, it is important to consider the skating strategy. White skin abilities, important for continue 80s. Killing is

necessary to ensure that is you grow. We match your niece so we can grow the number for the number of nodes based on demand include. Several other skating capabilities that allow Jackie to automatically resize your plaster based on your configuration and some way toward signals. I would like to hear how is using scaling scaling as well. Absolutely tremendous experience, the part of that comes with no design and planning to scale of resources and being a

part of the cloud. Provider is Google really, really helps with that. We have enough That problem is solved, it brings a new nose, whenever we need additional able to leverage horizontal pot, out of your customers during peak elevation hours, being a worldwide company, that can be at any time of the day, depending on what region were looking at. So, having to have a person, just doesn't make sense. We've also seen the opposite, when are when we're at our Valley utilization. We're able to save money as well as resources for Google by scaling down

low pressure service and not use those resources, which has been wonderful. And I'll just getting really help you with getting up and down to meet your demand, but sometimes we do need to do some capacity planning. Especially, you know, when you expect some big demand Seasons, like Black Friday, Cyber Monday or New Year in this period of time, we met I want to reserve juicy capacity so that's when we need that capacity. For example, for scaling repairing or upgrading will have a reserve pool of resources available for the cluster to consume. Basically Gigi

reservation is any integration between G K & G C, zonal reservation. So you can create a specific or nonspecific. Where's the reservation and Gigi will create new resources based on the pool reserved for you, there's an example here on the right side where we create reservation of 3 machines, we verify the reservation and then we create a cluster to consume any matching reservation that we created not. If we cover the provisioning and scaling of notes let's talk about the application, your business application. Those are on the note D for application of different behavior and needs some

applications are stateless and some are Statesville. More than 50% of Jackie customers. Real estate for work clothes on kubernetes successfully. It's not an easy task but luckily coronavirus and Jackie offer configuration and tools to support different type of application. Do you guys run any staple application on GT weed? Instead make it difficult or or border Leicester. And I think we're going to get into those here in just a second. Exactly. And and when you run different types of application, you

also need to think about disruption and how you can tolerate disruption for Destruction budget or pdb allows you to State your destruction tolerance. In other words, if your application is that they replicated, you can specify how many replicas you are willing to sacrifice when voluntary events like upgrade happen. In this example, on the right side here, we have a specified. If we cannot tolerate less than two replicas of our vacation. This is just another way to say, Max unavailable equal one, if you have three replicas and you weren't only one to be unavailable, GTA respect me

to be up to 60 minutes and also termination great sport. After 60 minutes. Two hours in total, As I mentioned. Carver's volunteer events like a brace or repair as you can imagine. If there is a failure of the physical machine, for example, pdb clearly one be able to mitigate. The other thing you want to be nice to know about your workloads is whether you would benefit from Cole catering. Some of your parts on the same machine or maybe actually if you do. So if you do locate someone for something machine, you running to risk

scheduler automatically spread your boards across nose. However, sometimes for example it's taking over States Jason bike. Ready, so zookeeper of servers is required to successfully, commit mutation to data by default in staple set, on the same pod, or whatever. For, for business continuity, we want to make sure that Call of Duty part, don't present the risk of single point of failure, right? So if we have Staples, that we can use parental affinity for example, to make sure that Scuba. He's done schedule, more replicas on

the same machine. So it recognized its own machine. On the flip side supplication May benefit from actually be Call located, for example it. So you have a web server and a care service. You want them to be on the same machine for better performance in and load it later and see you guys using any part and a finiti or with your vacation. Yeah, it's it's exactly like you said Kobe, we want to make sure our senses of workloads are exposed to single points of failure, statefulset specifically but also use preemptible know as soon as you know if you're in trouble, now is only lasts for a Max

of 24 hours. We want to make sure that are highly available. Workloads are truly highly-available and not subject to single points of failure and end free bonus content to find out if your workload is really susceptible to that or not. Write n in before we move to cover day, too. There is one more important aspect of business continuity and it's about having accurate signals to kubernetes know exactly how it works. I'm should behave. And he can actually more Authority, the health and state of your work clothes. The first one is reading this post. If your application,

for example, take some time. To start to initiate, you don't want us to start sending customer traffic to that work load, until it's fully ready to serve traffic. You do. So by setting up the rating of sprawled signal is lightness is your workload healthy and live or that it's required. Every pair sitting, this problem can provide better signal to kubernetes and also avoid for the party places in which coronavirus will try to repair a necessarily. So now if we set up the cluster let's cover several commendations for. They do. So

one of the most important think you need to carefully plan for they too is your address for energy and specifically plan to continuously upgrade your cluster, that's because it's minor version. Is released roughly every 3 months from open source and version may be released every month because we're going to destroy them. Both the control plane. And there are basically two places to manage when it comes to managing version. And there are many reasons why it's crucial to keep your cluster out to date among others is support right? Supported versions are get regular updates and security

patches. And we often times Sheree pics pictures from was the first version to another person every 3 months after two minor version back. That translates to roughly supporting each Miner version for about 9 months. And things to call out is is that an open-source announced that with kubernetes one 19:19 will be the first released officially received a year of patch support. So effectively prolonging the support. From 9 months to 12 months for kubernetes final version. The other requirement is to maintain

supported versions to you between your control plane in the nose. Knows cannot run newer version than the control plane and they have to be within 2 mi North version from the control thing. So for example, if you can Ron's on 119, your notebooks can run 119 1:18 or 1:17. The currently supported versions is Jackie automatically upgrade the control planes in the fleet it is recommended to use or afraid or at least have a plan to manually. Upgrade notes on a regular basis. This way, you'll avoid getting into an unsupported version, skew situation with the upgraded

control thing to manage your own, a great work. So you also want to make sure that you're a great strategy. Support any complaints that your workload may be subject to compliance. Pci-dss, talk to you. When he pay that Dexcom is, is falling as a medical devices company. Each one of these compliance has actually specific section that speaks about how fast you should consume Creek of patches. So, you definitely want to take a look into these complaints and make sure that you're a great strategy for the complaints. NGK. We offer fully automated process to operate both

control plane and notes. Control planes are managed and owned by Google and we initiate there. A parade as part of our GTA V. Release your business application, the workloads running on the Note, stay up and running when we operate the control plane and during the control point of great would you are unable to change a password configuration for several minutes because we are taking the control plane down while reading in you one, so when we are great. So no control thing. We do it by creating a new control thing with the new version. Then we move the network and abuse over and delete the

old control thing. Regional control, things are operating in a similar fashion, but you get the redundancies of three control planes. So you always have to control planes available when we operate one of the time. No doubt, we can be automated as well and we strongly recommended to use Auto upgrade to ensure that your nose are always aggravated to match your control clean version that approach sinking, your control plane version and not only keeping your infrastructure up to date but it also mitigate the risk of versions years we discussed earlier. By now, you

may be wondering, why wouldn't I use or operate? It is indeed the recommendation, and while the benefits are clear, you want to ensure business continuity. Also during times of Maintenance, know what upgrades used to follow the work, the works of delete and then create, that means that when we do our great note for, we delete one of the notes from The Notebook and then we bringing you one with the new version. We change at work. So recently will search up with, we search up great, we create and then we delete. In other words, we bring a new node with an anole. When that note is

ready, then we gracefully drain. The old note, you can actually control how many notes can be. Additionally, created for cert. And also how many existing knows you're willing to sacrifice for the upgrade process. The combination of Baxter's. And Max unavailable also allow you to control the pace of the upgrade, and it actually allows you to accelerate. No. Wait, so which, which can be important specially if you have large cluster? But I know that you guys also have very large clusters and you are using search upgrade to benefit for more efficiency. That's absolutely right. Kobe

and it's like, you said, we see a great increase in the continuity of the services because they have a place to go and take it away as opposed to having no place to go and go get taken away and also it it allows us to do our upgrades faster. We can bring more knows online and remove knows this week with greatly reduces. I mean you think about all the places around the world all the companies have to be these days. Dexcom is no different. We certainly have seen great benefits in terms of the time out of time, it takes to execute a notable upgrade. Thanks the third grade week, maybe

it's great in any know, we spoke about the importance of five grades to ensure business continuity and they too are also the potential disruption that operates may bring to running cluster. Knowing the different customers have different version management needs. We introduced to release channels in GTA V with the objectives to Ashley. Simplify the process of choosing your upgrade path and all the types of braids that you need for your cluster. Think about it this way. Some customers may want to use the latest kubernetes version because they have some dependencies on newer API or capability

or because they had some specific business objectives, customers may prefer stability over, everything else over functionality, in the freshness of a person with released generals and their three release Janos Rapids, regular a stable, you can choose where you want to be in terms of upgrade paths. So instead of choosing a specific person, you can subscribe your plaster to a release channel. That means you're neat and Gigi will take care of the upgrade. The differences between channels are soaked. I'm in maturity of the release. So we first introduced aversion to the rapid channel. Will it be

soaked with promoted to regular Soak again, and then eventually we promoted to stable. Are you using released channels or the Rapid or the red stable channels in your production environments? I can't go into specifics on that, but I can't tell you that released him greatly simplified, how we evaluate the upgrade process. We have to kind of pick and choose and and studying, release notes, religiously. And now, we get a recommendation from Google saying this as had this amount of time and we looked at it and believe it's the right

version for their workloads that you're looking at a price. We have sensitive or clothes. We want to make sure that they're using the stable release Channel, rather work, loads are their environment. Maybe we use different channels that are may be more aggressive or have newer teachers in a week earlier in our development. And it is important to mention that you can unsubscribe cluster from a list Channel if you need to do that. And you can also migrate between release channels in some cases. But this is all sound great. You can automate the process, you can figure configure the nature of

the upgrade and controls the Virgin maturity, however, breaking changes, and they need to be tested and validated in a pre-production environment so that you can ensure business continuity. This is why we are introducing available versions for channels. Each channel will include one default version and available versions Auto upgrade triggers the rollout server to promote the different versions, and upgrade the control thing first. And then your note Bulls, if you are using a rock rake Before versus become Depot. Jackie will make it available in

the fleet packs. Versions will be available one week, before I roll out in my new version, which tend to be a little bit more riskier will be made available for weeks before we recommend, testing available versions on pre-production environment. Mitigate discontinuity risk. And also to ensure that when we do upgrade your production environment autoharp we can be completed successfully, you can use the command line that I have in the bottom of the slide to view the people in available versions for the release channel that you are using for your classroom. And sometimes disruption simply

cannot be tolerated, right? I mean, there are some pick business Seasons or in a major company events like Trade Fair, we recommend to set up exclusion window for this event. With exquisite window you can specify a range of dates in which you want us to completely refrain from any maintenance. You can also set up my say when you do preferred maintenance to happen that can help you not only improve predictability of the upgrade but it can be very handy. Especially now, when people work from home, you might want to ensure availability in alignment with the working schedule.

When maintenance actually takes place Dexcom also have I assume major holidays. And, you know, plan Peak business hours, how do you use maintenance in exclusion windows. It's funny you mention even working from home and kind of the nature of the situation that we all find ourselves then because you know, in the olden days we want to see people, you know, in the office manage things and I was working from home, that looks a little different but we still want to have people awake and available and in their office, wherever that may be just

in case because if you want to just be online and ready, you're not waking up in in similarly, you know, when it comes to holidays or certain events, we find that you know, other times that we would just not want to risk an upgrade just to make sure we provide the best service to our customers. Absolutely having that flexibility has been critical for us. Already kind of implicitly. Mentioned several times is having multiple environment setup, I can't stress enough. How important and essential it is to

have a pre-production environment in place so you can reliably deliver upgrades to your infrastructure into your software safely and in timely manner, you can think about it in a way that you have testing in vitamins to test out new releases before I really knew him out to production. In this case, we recommend subscribing, both your system faster and your production closer to the same channel. You can, you can use the same version between testing and production. In some cases, you may want to consider Canary or even stick and development. Environment to test out new were release it

using the Raffi channel. That will give you an opportunity to stay ahead of the curb by testing out, new features or API. And basically improve your time to Market Windows versions, get promoted from Rapid to regular. And then finally, to stay With multiple environments. We also need an efficient way to orchestrate. An upgrade work flew across multiple environments but in order to be proactive about it, you need to be able to predict an updated schedule for your cluster. There's a great way to do so we have a key release notes that also offer RSS feeds you can you can

use that. Also if you are using religious channels each Channel, offers a separate release note. So you can always subscribe to the update. That's relevant to your class turn, you can use DTaP, I specifically get server configured to retrieve, the least of these all options are great, but they do require you to put information on a regular basis. You need to be proactive about it. So to simplify this process, even more, we are introducing GT Auto Auction, which will be better launched later this quarter. identification is based on cloud Pub sub, it allows

customers to subscribe to a great event and get automatically informed about new available versions and also when the operator she takes place on their cluster In this case that I have here in the slide, we have a devil steam using a slack Channel and unification is being pushed to their channels. Their sock Channel Through the pop subscription and the theme. This way know about and notified ahead of time about new versions. This way, the team can open the orchestrator specific workflow, for example, of Greater testing environment, authority to release, make sure that the

release is ready to be safely, rule out into production environment. Let's put everything together. We spoke about different concepts approach to automate and streamline upgrades would involve pre-production environment using the available version that we mentioned before an application to set up on your plaster. That will inform your team. When in you available version is ready for them to test and certify and eventually approach environment, subscribe to a release Channel, using the default version in the channel. So we covered a lot

and if you can believe that, we've actually got more, we can cover but we're hopeful that you gotten the sense of some of the best practices and recommendations. Both Google's experience managing really, really dangerous when it comes to running a highly available really tough with the strict SLO that sell leis product. There are a lot of things on the slides that we should talk at Great length about but we won't hear a little bit about this. I really appreciate Toby's. I'm getting to do this with them. Thank you

for tuning in.

Cackle comments for the website

Buy this talk

Access to the talk “Ensuring business continuity at times of uncertainty and digital-only business with GKE”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “Google Cloud Next 2020”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Ticket

Similar talks

Gurmeet (GG) Goindi
Product Management at Google
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Brad Calder
VP, Engineering, Technical Infrastructure at Google Cloud
+ 2 speakers
June Yang
VP and GM at Google Cloud
+ 2 speakers
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
David Feuer
Senior Product Manager at Google Cloud Platform
+ 1 speaker
Miguel Mendoza
Technical Solutions Consultant at Google
+ 1 speaker
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “Ensuring business continuity at times of uncertainty and digital-only business with GKE”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
635 conferences
26170 speakers
9693 hours of content
Kobi Magnez
Donovan Carter