About the talk
How Equifax is monitoring their entire GKE fleet at scale by using uniform telemetry standards: logs, metrics, traces. The out-of-the-box Kubernetes Engine Monitoring solution creates a path to easily identify and understand what part of infrastructure and services are affected, and how to take immediate action to remediate the problem.
Speakers: Ruxanda Danetiu, Vipul Mapara
Subscribe to the GCP Channel → https://goo.gle/GCP
product: Kubernetes Engine; fullname: Ruxanda Danetiu;
event: Google Cloud Next 2020; re_ty: Publish;
Thank you so much for joining us. Today, we're going to be talking about kubernetes engine infrastructure and service-oriented monitoring with Equifax. My name is Jackson. I'm a product manager on Google Cloud. Focus on cloud monitoring. And today, my guess is going to be the people. He is a lead, a family and exercise For the next 30 minutes, we're going to be focusing on three minutes. We're going to start by discussing SLO monitoring principal. We're going to do even a quote from Equifax,
Equifax have adopted at the Lowe's and the digital transformation Journey on the taking. And what would that my heart rate which is our engineering delete unclog, monitoring is going to give us an estimate. Stop thinking a little bit about what are the principles behind us about it. Feels like we are talking a lot, about numbers and Target available. We talked a lot about how many nights do we have, and how important it is to collect these numbers of nice, we talked about how important it is for us to have the best reliability out there and
try to strive extensively to reach that 100% lighting for basically everything. And that's what I want us to start doing. From the point of realizing that this conversation is not about numbers. It's not about getting every single microservice to its maximum availability possible and 100%, it's been on target to begin with. And surprisingly not even though this is such an incredible engineering experience. While we were going to talk about today is actually mad because what is more important for us than anything else is to figure out? What is a
common language that we can give every single person with the ability to communicate for the Samsung building requirements with your product, senior marketing. Team of yourself, can you end up defining the goal in the business of gold for your product? You are starting a meeting with your engineering teams and do the development of. Do you start working on your operation and your operations to ensure that what we build it? The problem is the way that I told this morning is that it feels like pieces of sequins tops and you have to be in for used all of these people into the
process, one by one. For what is important for us is that we actually need to have the business the development and you'll bration seem to belong together on the finding. What does the best look like for the organization of the car? And when we talked about success, it has to be a single metric that all of these things are all of these people get behind. Dean, electric really need to respect ybn. Customer experience is all about. We need to think about this. You Jake on The End user uses our product and we need to ensure that
we are building the right measurement of that CJ. And we made this measurement understandable and available to everyone in the organization. When we talked about this fine, fellows ability to create a Common Language across Business Development and operations. And what is important for the organization and how to make an Ender Chest into trackable map? Because this is a Common Language and it's been, you are with your cards are, are we going to have a few new terms? What did I need to
get for me or not? So these are the three or four arms of service-oriented modeling and it's a lie is the service level indicator. Think about this for the metric under threshold, it is a measure of what success looks or feels like it's a service level of Jack and the the different beard think about the Metro that you build your ass Ally on and you attach a goal to it. So you have a fraction of your successful interaction. You're interested. Let's say you have and you care about the measuring a customer happiness, would you check out some,
you know, that a lot of your financial team is a really interesting to make sure that people can check out and pay for the items that they care about. So we know that that is a Cornerstone We know that we were supposed to be available, but most importantly to experience birthday and it could be a matter of seconds. So now we need to know how that. So, maybe we can get to where we can have a business goal that at least 95% of all of our customers, have to
open the truck out car serviced and then you'll like the date on the items in that. You got a car in less than five buckets. And that is the goal. We are talking about the end customers in addition to make as we have to thinking about how to measure. Most items here. Probably. All of y'all are already familiar with because an Escalade is the service level agreement and that it yet the line the metric how much is the goal. But then it has some consequences around it. So for example, of external customers tell them that we have an availability
would be available for this amount of time and if we don't meet those requirements then we will offer you a refund. So a lot of companies have implemented as allies in his contract with the external customers. But we are trying to get them to understand is if we already have the underlying underneath the radiator International drought and you know, I work for metrics that can transform your gas station and bring your business your development in Europe. Recent into a single
friends. Your language. I think it's important for us to see what is it a great example of how to change the language and put it in your eyes and Amazon has been an incredible. And I am extremely happy to have today. Before he is an SRA lead and he's going to tell us a little bit about yourself. Thanks for trying though, for having me some pressure so bad. My roll. Alvin with Equifax, 14 years, Equifax has been great up and started as a developer in various roles. And then most recently, I've been working as a SRV
for one of the Alliance Data fabric with a necrophile little bit about Equifax. Equifax is a global company. We are data analytics and that I'm responsible my day today is a necessary Enterprise data Lake Padden and Equifax is on this digital transformation journey and Google is over platform of choice for the formation. So as part of that Journey I'm responsible to Mandarin Tri-State family, the capabilities are ingest store process analyze Analytics API, surprisingly most off of services that we leverage from
Google are covid-19 data flow, Google helps improve our security posture. I mean, Security Disability until we are running so fast and not sure what we do and the services and the capabilities that we are serving. Are pretty much following the container at the extraction. People, you mentioned microservices. You know that Rick Russian. So at the beginning, you know, we were working on the truck. So it was a little hard to get a visibility into our applications, and operations was little bit tough. And also in a pre Cloud
over engineering and operations were working inside. But now that we have decided to embrace of a digital transformation with the public Cloud, we don't just a relationship. We are really thinking, as we are protecting the application and using the better call the tools that are, you know, what? We had ended up working out for us and we really need a tool, which is as good of the variability into when we are moving around. So if you don't know, I mean what we are after all
do because there's too much newness about the public Cloud. So what do end up is? You can start with for gold? Signify mean it's part of a sorry bastard and then work your way. I mean I feel that's a good starting point. You mentioned this allows tell me more of a little bit about the journey behind why you adopted the fellows and why do you think I'm all kind of them named benefits for Equifax to start using that flows? I mean, a fellow has been a great concept and one of the best doctors that I learned and
it's a single language of crossover engineering cleaning product, e-marketing finance, a car with a common goal, essays houses to prioritize our efforts. And once we have that kind of single language to work across the organization is much better. Otherwise you have a lot of meetings with analysis paralysis. You go to a church on but you really don't have a directions to give Aoki needed meeting a direction as LOL, so that allows us to figure it out where we should try the investment. And also it is pretty important, you know using a fellow as they are guiding principle when to
wake up and want to know what really is impacting the revenue which is important for business. So, I feel that. That's pretty common. Team, didn't equifax's Embrace SLO, as a single language. It sounds like a lot of lot of benefits. Could be honest with me, one of the hardest part because this one's Too Good To Be True. Grit in education is definitely definitely hard the way I kind of do when your customer comes to your restaurant. Bill really expecting you did not really interested into
getting the all the ingredients so that you can prepare the meal. Customer is really interested in your car. When you press on the product, what? It should look like. That is exactly what is the feeling about your product and that is the key. So it's it's very easy to focus on the ingredient and losing focus on the mail. So that's the hardest Now what helps us to bring the focus is the mindset, and you need the whole organization because if you started just as an engineering project,
because you have not brought your other stakeholders long with you who has buffet in your envelope. So that is the other thing. You want to focus on the mindset and frankly. And Equifax, we started doing this. When we brought Google Siri, who is your customer rule of Engagement? Looking back at, this has been a journey. It didn't happen overnight, is there anything that you would have done differently knowing what you know today? For me personally, when I took the authority to wish I knew SRE much earlier, I did not know much about it. So I had to do a lot of research. I
had to read some books, I had to go and read new Lotto reading about it and a product that works at a global scale is not easy. You really need to have a engineering mindset to an operational problems and since my background, yes, I think I can do it. The other thing I would have appreciated, I would have spent more time and would be a Hands-On. So I would feel more comfortable but once you get this concept of SLO and working with the engineering mindset to a notch and start learning about
concert. I think you will not I think it's a good, that's what I feel. Getting back band at the long journey, and this is a very intimidating Jeremy and I want to know, can you give them some advice? What do you think? It's really important to know at the beginning. Rachel. I mean it is definitely hard drive I think would work for us. All I can say about little bit tomorrow. Think about a fellow Concept in every budget Concepts, really learn about those Concepts. Decide how we want to get
started have socialized SLO contacts within the organization. Bingo, take orders along with you, figure it out where you would really try to implement a critical Journey on a fellow around. Then once you do that, I I I don't think it helps you and Google Tre helps us to navigate stakeholders how to present the view around it, but I would start small on it. Once you have that small pilot project with a blueprint with the some of the best practices of a solo and everybody accountability to it. The next thing you would
get into his pickup, the right to strike. You need the right tools to achieve the scale of how do you implement frankly? I mean in GTA V the full golden signals are available out of the box so you could literally Decals for Golden signals and instrumentals and your appellate out of it and Google has a product. Google Cloud monitoring at the lower APR APR to allow us to build at the Lowe's Auto. Or golden signal. We started doing that for the first pilot, it gave us a single view for a pilot and then from that blueprint, restarted replicating.
And moving all the stuff we had, all of our services, managed to call monitoring and we had a single view of Oliver. SLO said What you? Talkin has a snail ecosystem, that is already manages. And that's how I'd feel if you start small and take the right to and have a solo. If he is your to thank you for sharing your personal story. That was really mean, I know a lot of people listening in right now at the beginning, They're at the long journey. Is there any advice that you can give somebody that starting? That was a long journey for the first time? Definitely, I
think I'd Equifax SLO has been a journey for us. I would suggest a stock small really understand the SLO and the arrow project Concepts. Because you understand the concert, then you can decide on how to get started. I would advise like research in channel to bring along with you and then started identify a small pilot figure it out the critical Johnny's that goes along with the pilot. Also get some help from Google Tre who can help you to navigate the SRT best practices on building at alone. So great, the blueprint for other service.
Once you have a pilot identified, and you have to find your critical uses yearnings, which one the next topic is really about to ride to, that's about that. And beauty of GDP is they support out of the box. That makes a building the antelope as soon as possible cuz you can use for signal and start building your SLO. Once you get into the f l o, u can use Google apis which allows you to build a fellow and then you get all your effort and I think with those two
point, I feel it's a great stock. Thank you so much people for taking the time to talk to me today and for sharing, but the meaningful story of your employ, adoption and Equifax. The next let's let's take one example. I think about us alone monitoring for your Jeep needs it. And what I wanted to because my back is to understand that the functionality really does not depend on the underlying infrastructure that you want. So we have decorations for every single animal or any other.
We have build a few rated out of the box experience when he on as well. However, if you are have a movie called employment running on Primus environments as well, if you're on your mattress into the back and we would be able to Environments as well even if they are machines or odd. When we talked about how I spell omana didn't really build on the scale of the entire article back and for both monitoring enlarging. This is the same backgrounds that operates in Cardinal Google as well. And are floatie is currently used,
internally are Google but it's also used by our external contact numbers. Everything gets billed with you from day one and it gives you all of them. But there are a lot of a lot of metrics that coming out of the box and get added to the product for you to be able to create yourselves very easily that are we talking about in terms of SLO monitoring at scale from the UI and ADI, they are in general availability as of today. So I was really happy to make the
Functionality available through all of our customers and emailed it to Showcase experience. Patrick either. Have how to build a u y, n, d, a p. I for SLO. Monitoring is going to give us a whirlwind tour of how to do a solo monitoring for your ticket away. Patrick hello, today, I would like to introduce you to this service-oriented monitoring and SLO monitoring features in gcps Cloud monitoring product, these features are currently in general availability First, I want to draw
your attention to the workspace selector in the upper left. A work space is a concept that allows one to combine multiple projects into a single view. This is great for me, being able to view an entire fleet of services in a single pane of glass. Next notice that were in the monitoring menu and the service is Sub menu of the cloud console. The first year you'll see in the services area is the service inventory or Services overview page on this page, we see a list of all of the microservices that have been detected
are identified by the user. You can see that there's a number of different types of services that depends on the on the technology that use developed the services. We currently support, g a e. Plotting points is c o, n g, k e, n t service mash. And what we're going to focus on today gke based microservices in order to understand is Maura. Let's start by defining educate. You microservice. When we click Define service, we see that the tray slides out with a list of all of the GTE entities
that are in the workspace, including deployments replica, set, kubernetes services, namespaces, and so on. Depending on the model that your organization uses, any of these entities could be a microservice. Now, I know that the way that my organization works, Deployments are what correspond to microservices. And I also know from discussions with my with my brother organization including p.m. and engine operations that the ad service is a service that requires particular
attention. So we search for add service and find out. There's actually two of them running in our work space, looks like one of them. Give them a name, click submit. And with that we've created the first GK debased microservice. Next, you can read more about best practices for monitoring services at the documentation link here, or you can move directly to creating a service-level objective. Instead. We're going to go back to the inventory and noticed that our new microservice appears in the
inventory already. It includes the metadata labels. They're used to define the microserve including the project of the cluster named CSM demo and the deployment add service. It also indicates that we don't have any SLO. Zara, SLO based alert. Again, if we click into the dashboard for the service itself, at the top of that, we see that the first piece of monitoring data is about s Loz. This is granted space at the top of the page because as solos or one of the most valuable things
for monitoring Services. No we don't yet have any I sell those. So let's fix that by creating a nest. Hello. Again this describes the service that we're creating the soo alert for some service types. Have default availability or latency metrics, but we're going to stick around metric. Now there are a number of metrics that are available out of the box for all users to define a solos. For example, here is a list of container metrics that may be useful for finding us a Lowe's gc-lb. He's also have a number of metrics, which
can be useful as the developer of this app. However, I know that the thing that best reflects the performance I'll be at service. Is the response latency metric. Based on discussions with p.m. and others. I know, the customers are satisfied with the ad service, when it responds in 30 seconds or less on any of the containers that are running that image. Once we describe the metric that we're going to use to evaluate the SLO. We can see a preview of how that metric is actually
performing. And we can also guess that 30 milliseconds is actually a pretty aggressive Target, given the behavior of the system right now. To finish the finding the SLO, we set the compliance. To be a calendar week and a performance goal of 90%. What that means is that 90% of all requests to the ad service, need to finish in 30 milliseconds in order to be compliant. Funny, we give it a name, we click submit and now we've created our first SLO. We can evaluate the SLO
based on the historical behavior of the system, so we don't have to wait for history to accumulate. We can see that right now. The service level indicator, the metric is running at about 95%, which is greater than our 90% Argan. We can also see that about half of the Year budget is remaining for the week. if we expand the time selector out to a week, we can see that indeed at the beginning of the week, the air budget resets and then Trails off towards the at as the week progresses based on last
week, it looks like we will run out of air budget. So there's probably some reliability work that needs to be done for the service notice. Also that there is a view to look at alerts SLO based alerts that are firing for this service. In order to create these, you can use this link here and not alert to be created right in context vs. Hello, based alerts are based on the concept of a bass print alert, which means the system can detect. When the air budget is being burned. At a rate, it's not sustainable and such that the system can remain in compliance. If we look
at other parts of the service dashboard, we can see that the metrics widget is interesting. It contains a lot of the out-of-the-box metrics that are available to users arranged. The number of bundles is currently seeing the container metrics corresponding to CPU. We can also look at for example, pod level metrics on network usage. We select that and the the charts update, we can see individual pods and the network usage on those. Alternatively, we could look at a node metric such as CPU. To see what kind of
capacity are cluster. Has If we scroll down further, we can see the pods the kubernetes pods that make up the service here. We see we have five pods running the ad service right now. All of them already and running so they're healthy for each of them resource utilization. As described and under the Snowman, we see a link to the kubernetes engine page. If we switch to the kubernetes part of the cloud console, this is where we can see more information about the Pod including its events. Its logs, and also take actions on the odd such as
restarting. It the last part of the service dashboard is the logs widget. This widget gives a brief overview of the logs that are being emitted by the microservice is however, we want to look at the logs in more detail. We can go to the logs viewer. The interesting thing about pivoting to the logs viewer is that the context or the filter State for the micro service is preserved. You can see here that the filter by default goes to the cluster that were, we've been monitoring and the
ad service. Even the timestamp filter is brought up, brought across to the log viewer to preserve contest. The log viewer shows a histogram of log volume. It also facets it by severity. We can filter to heirs to the air severity. And very quickly get to the stack traces corresponding to our microservice. Now let's return to the idea of SLO based alerts. If we go to the overview we can see that there are three SLO based alerts that are current currently
firing. If we filtered of those Services, we see that one of them is the default gke namespace. Let's look at what a service-level dashboard looks like. When they're incidence firing the same, you'll notice different is that. Now there's an alert timeline at the top of the timeline indicates when incidents are firing by the red bars on the timeline. If you hover on these bars, you see a summary of the corresponding alert Along with a link to get the other details. The.
The events timeline can also bet that the alert timeline can also be used as a Time selector. And when you select the time there, that also updates the time for all the charts. And for all the log entries. You should also notice that this default microservice has to slos Define, so you can have more than one SLO Define for a single service. You can also have more than one alert fine for a single SLO, That completes a tour of the UI for a service-oriented monitoring and SL. As you should know that, if you're interested in managing large numbers of services or SLO, there is an API.
It's also available that allows access to all of these features. It has been my pleasure to introduce you to service oriented monitoring Concepts today. In Clyde monitoring. Thank you. Thank you so much, Patrick for Lyn seippel demo. Really appreciates the work that went into building the product and put it together. I want to share a special thanks to the aquasox game that has been one of our users of SLO monitoring. I want to also think that we are eating the crowd liability engineering team at Google using the SLO monitoring product as a best practice for adopting a ferret,
You are. I'm really happy that I finally, and general availability. And if you want to learn more about operation fashion and learn about all of the word between available, thank you so much for your time.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.