About the talk
Help to improve developer effectiveness and reduce toil with tools such as Cloud Logging, Cloud Trace, and Cloud Debugger. See how our customer, Wix, integrates Google Cloud's suite of products into their IDE workflow to offer a seamless experience.
Speakers: Meredith Hassett, Bryan Zimmerman
Google Cloud Next ’20: OnAir → https://goo.gle/next2020
Subscribe to the GCP Channel → https://goo.gle/GCP
product: Stackdriver Logging, Cloud Monitoring, Stackdriver Debugger; fullname:Bryan Zimmerman;
event: Google Cloud Next 2020; re_ty: Publish;
Welcome, thanks for joining us. Today we're going to be talking about how Google Cloud observability Suite of products is creating a better developer experience for its customers. I'm Brian Zimmerman product manager, with Google Cloud developer Advocate at Corbin byway. As I mentioned, we talked about specific tool to help developers be more successful day today but these exist within a larger ecosystem. And this tool set is based off the same Planet scale core platform. The power is Google's internal observability so you know it's been battle-tested. In
addition, it comes with built-in zero-day observability for key Google services. This is all under the umbrella of our logging and monitoring products, But it includes much more. When talking about how to help developers day today, we first have to ask what did developers Deeds from their durability tool set? We talk to our customers. There are three key themes that we see first. I need to diagnose issues quickly. When there's an issue with your application or site, there isn't time to go hunting for logs or hunting for metrics. You need to be able to find the
problem and diagnose the problem with these second, some problems are harder to find than others and with a large data available about your application and the infrastructure. It runs on especially in the cloud. Having the tool set that helps you effectively Find needle in that Haystack. He's a really important. Thirdly not all issues can be reproduced in the development environment and even if they can be time-consuming and sometimes you need answers quickly. So being able to be back in production is often critical to reducing
mean time to resolution and being effective in the cloud. Where do we talkin about each of these areas with specific examples of how they can help developers? Be more effective? First of all, let's talk about diagnosing issues. Quickly. Metrics are a key part of understanding problems and what's happening with your system traces and logs have the contacts and details to help you diagnose the cause of head issues. But they're not always linked together. So going from a metric to the request and log that generated that particular metric datapoint. Is difficult
and important to be able to diagnose issues quickly. What's the solution a feature? We call example traces. Otherwise known as trace exemplars the screenshot on the left. You can see a heat map distribution of latency that has two important additions. One their aggregate percentile lines, that help us understand how this distribution is trending second there are example distributed traces that have been sampled these represent complete request and you can actually drill into that request without leaving the context of your monitoring tool, which you'll see in a
moment. Bellet see this in action? The demo. We're about to show you is of our example application, called online boutique in this scenario, users of the application. Who normally experience great performance has started to notice extremely degraded latency The developer in the video will show you how to use the Google Cloud operations tool. Sweet. And example traces, as I mentioned earlier, too, quickly diagnose set issue Cloud monitoring and Cloud, Trace to quickly understand where we supposed to get our attention
that regards to a particular issue. In this case, we have an alert related to 99th percentile latency which is spiked quite significantly. Drilling into our performance dashboard. We can see that there was a significant increase in latency around the time of the alert. This seems to be possibly correlate with error rate, but not exactly. This increase can be shown both by the distribution self, but also the aggregate percentile lines, and how they have increase over time. Let's drilling
further. Typically, this is actually quite difficult to go from here. Understanding the specific request that Led Zep. Nefertina point is not always easily available. However, with Google Cloud operations we can overlay a example, traces Express. Now, we can easily without switching context drill into exactly what was happening at the time. The issue was observed. So listen since this is a call from the product. Which is making a call to get product and making further call to get product, which is an interesting Behavior out of the
7 second generation that 6.9 of them were related to get product. See if this seems to correlate with other types of request, this is another similar product call which has a similar profile of child calls to the product catalog service. This here's a card operations, that's different. However, when we look further it has the same, get repeated get products call in the product catalog service that we saw earlier. As we look further, we're going to see
similar Behavior. A product catalog service operations, Representing the majority of the latency. Now, to be sure of that. Let's compare. With braces, prior to the issue here is a similar get product call. And it's making calls the recommendation service. Again, to the get product service. It, it makes a few calls to get product, which is what we saw before. However, if you look at the total latency, only 31 milliseconds were spent within the service, this compares to 20
seconds and some of the other examples we were investigating Definitely seems to be a delta in the length of time of get product called, that is representative of this late. At this point, I have a viable hypothesis for what caused this issue, and I got there without actually will leaving my mom. Drink at 4 From here I would look at logging. Perhaps the Bucking are reporting to try to understand what was actually causing the issue. But we've shown here is that with he's either going to lie, you can quickly
understand where you should be focusing your attention and where the problem may be located. The previous example focused on speed. But that's not always what you need. When it comes to finding a needle in the haystack, it's about Brett death and thoroughness of your tool set. In the next example. We're going to show how you can use the Google Cloud operations to sweep to look for a very difficult to find cause of an issue. This focuses on logging and Trace, and will follow that up with profile. To demonstrate finding a needle in a haystack. We're
going to look through top monitoring. Trace Cloud, login. Starting without monitoring to see that. We have an alert related to latency looking at our dashboards and specific performance dashboard. We see that there was a spike in the errors, on the front end, that occurred fairly recently. The troubleshoot there. We're going to drill into things without locking. What can the logs viewer? I can very easily filter for the log lines that contain air. I can also drill down to kubernetes reverse type
where the air is came from. In this case, they're almost all from the front end. So what's happening in these instances? Really into the details for these logs, I can see what's happening. In this case I see failed to get recommendations. I expect that other errors are very similar. Failed to get progress in the nation's again. I can add this to the summary line in order to see how prevalent this area is. We've been myopic right ear. Not to understand what the recommendations service is doing. That's open Cloud tricks.
In this case, I'm going to filter. Or spam. Otherwise known as service calls. Where my front end is calling. My recommendation service. We can see the late late and seen Spike heavily around the time that I received the alert and started receiving those errors. Let's see what's happening. What we can see here is after calling list recommendations. The system is making multiple calls to get product from the product catalog service. That's interesting. In fact in his 19 second, try starting to get cart request. Almost all of the latency
spent within those get product call. I definitely not what we want to see. And it suggested that's where the issue is occurring. Where do you find login to understand what's happening with in that product catalog service? Again, it's very easy to find this out by drillion to the resource, type in the field Explorer. Where the filter for the pod means product catalog service. At what we can see is broccoli at the time of the air, there is a massive spike in logs from this service. Narrowing
down to this time interval. We can see that almost all of these logs are successfully, parsed product catalog. This point is safe to assume that this is where I can focus my attention. I'm going to open up my ID and understand what's happening with the person of the product catalog to find out. Why is it actually successfully? This is a good example of how to use the operations tool set, including logging trade monitoring and more to quickly narrow down a hard to find problem. Sometimes a
problem can't be easily found in the logs or the traces has to do more with the functioning of your running code or perhaps about cost optimization. This is where profiler is extremely useful. Elian is going to show you how you can use profile to solve these needle in the haystack type problems. Cloud provider can help developers understand how does services are spending compute and memory resources. We will explore how profiler can be used to identify optimization opportunities and help you run your services more efficiently supports for languages Java go,
no dress and fight them for this walk through. We will lose the product catalog service and a CPU time profile time, the default view for profiler is the flame graffiti shown here. Tell Stella purses, understand which software code are using the most resources. We can see to the stock with the function of parts catalog is using the most of your time. But about 72%. We can see the parts catalogue is calling to Jason and Marshall function. So, this has to be the first focus of our investigation. Let's open the top functions table and sort the functions
list by the self time. We can see that 8.3% of CPU time spent in the function state in string, which is not evident from the default view, shown in the plane crash, By clicking on this function, name below the plane graph in the Focus filter mode. This mode is useful to identify functions which are called from multiple locations in your court. It also helps us understand the various calling paths into this function. This is very useful for understanding the resource usage patterns of Library functions which may be called from different locations in your car.
removing the Focus filter visits the flame craft, which default view profile within the selected percentage of pecans option, for example, Top 10 person indicates that only profiles that was collected. Using the top 10% of CPU consumption are presented for analysis. You can also use the compared to function to compare profiles from two different paper clips. Here we are comparing profiles from the top 10% with all profiles. the Flames that are shaded in Orange, have higher resource consumption when compared with all profiles, we can
see that it is some increase in consumption in the client handshake function, stack a swell, Resetting the compared to Los the T4 flamecraft. You So these are some of the features of profiler that can be used by developers to understand how the services are spending resources and where they should start their optimization efforts. Our final scenario talks about how to debug code when it counts. You ever had a scenario where you could not reproduce an issue in your development
environment? Then this is for you, you've ever had a snare. You aware that one line of logging that you needed was not there when you needed it. This is for you. In this next demo we're going to see how Cloud the butter can help you debug in production, when you need it. Most Start this journey. Open are reporting. Are reporting automatically services are has found the logs for easy discoverability. Within the Air Group, we can see the trend where that error has occurred. Where are the code that error has occurred and easily like the logs that
generated the error? Exploring these log. We can see quite easily but this is exclusively caused by the Currency Service. We can inspect the log for further details, including Resort contacts. However, this doesn't actually show us what's happening and causing this particular problem for that. We're going to open Cloud the bugger The bugger allows us to see what is happening in our actual running production code. In the laws, he saw that the area was happening at line 55 of the
Currency Service, Rivergate? Clicking here, we can take a snapshot of what's happening at the code at that exact position inspecting, the details of the variables. This was done without restarting the service, without pausing the service completely safe in production. Avocados and obvious here, but we can see from the code. That is results. That units is not greater than 0, it would throw that they're so let's see what results. Unit is that pack. This case is set at not a number, so that's obviously not going to work. Let's see how this was
actually set. And in this case, where we sat that value was also not a valid number as such, it will generate the air that deposit in the wall. Next, we would you deflate the code to resolve this issue, but you can see how with production to bugging, this was easy to find without any impact to my system. And that concludes our three scenarios. Next. We're going to hear from Meredith pass it from Mike's. Praying for all about developer velocity and operations, really helps developers speed up their process, which enables developers to
add additional cause of logging as well as error. Handling is crucial for their large penis. Rappers ladybug issues that come up in production and make sure they're users is getting the best experience. Let's take a look at how to build an issues in on their site. Let's take a look at a vacation booking site where users can provide the preferred check-in and check-out dates to get updated. Quote information about the requested holiday. We're able to do this using Corbett and Wicks to get Dynamic information from third parties. Displayed on our
screen. Let's take a look at how this looks in the code in our Wix editor. We now see a new site structure and code IDE pains Corbett is our web development platform for websites that opens up apis, as well as you I functionality for you as a developer. Who is finally request quote function. We can see us actually coming from the back end and the log of fight file. So if we go ahead and take a look at our log of Phi, Phi back end, We're able to sign the request quote function and see that it's calling
an API to return booking information about a property based off the user's input. What's that quotes returned? We may be some error handling to determine if there's any issues with the quote-unquote information. And then check to see if the quote has any error code. If there is an error code me, they want to return an error status to be able to better do bug. Go to pass through properties from the quote, as well to the error. Once our error handling and logging is in place. If we publish this will make our
site changes productive with our production site in ready to go. Let's go ahead and enables a logging back on our wixsite dashboard. When we go to settings, production tools site monitoring were able to easily connect to the Google Cloud operations logging tools. When we connect, we just connect with our Schuylkill, single, sign-on account, and were able to get to the dashboard. Once were in the Google Cloud dashboard for able to look at the logs for our site. So let's take a look at what our users flowermate look like and how
we can do some error handling. Back in my Wix editor, I'm going to go to view site and on my site I'm going to go through a user flow of trying to book a vacation. Let's take a look at the properties that are available in select one that looks like a nice vacation home. Once I'm on vacation properties page, I'm able to play around with the quote information and see how changing my requested dates will change my quote information. We can go through trying to book this site as well.
I know, I will try to place my booking. Are they using for a run into any issues and can process information back along the agency and see if they can help me figure out what's going on as a developer to make sure that your site is healthy. If we take a look back and I were able to retrieve, if there's any information and find any errors, I can filter down for my errors, see that there was an issue with a quote retrieval and determine what went wrong by taking a look at the information and then other events and walked around it, having integrated tools like Google Cloud operations for
logging of website. Like those built on Corvette by which enables us as developers to do our jobs more efficiently. Spend less time looking for errors, and be able to resolve any issues that our users may be having when they're coming to our sites. Thanks Meredith. And there you have it. We've demonstrated how Google Cloud operations Suite of products can help make developers more effective and efficient day today. Feel free to reach out for more information or as always visit as a cloud. Google.com.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.