Experienced Production Engineering Lead with a demonstrated history of working in the internet and software industries. Skilled in Go, Ruby, Distributed Systems, Virtual Machines, and Software Design. Strong engineering professional with a Bachelor of Science (Hons) focused in Computer Science from The Australian National University.View the profile
About the talk
As cloud systems become ever more complex, it becomes difficult to find and understand what impacts the performance of your system. Learn how our customer, Shopify, analyzes distributed trace data to identify performance bottlenecks and fuel meaningful improvements to benefit their users.
In this session we cover:
- How tracing gives you deep visibility into the performance of your distributed system
- How to analyze those traces (using BigQuery) to identify bottlenecks
Speakers: Francis Bogsanyi, Bryan Zimmerman
Google Cloud Next ’20: OnAir → https://goo.gle/next2020
Subscribe to the GCP Channel → https://goo.gle/GCP
product: Stackdriver Trace, BigQuery; fullname: Bryan Zimmerman;
event: Google Cloud Next 2020; re_ty: Publish;
Welcome, thanks for joining us today. We're going to be talking about analyzing distributed traces to find performance bottlenecks in your system. I'm Brian Zimmerman product manager with Google Cloud. turn on Francis, I look on the distributed tracing of Shopify and I'm also in maintainer, the bothans lemon tree review What's the speak about performance troubleshooting? About performance problems. These are problems with the latency or compute utilization of your services. And these are things that will impact your users experience of your
product and your bottom line. These are often difficult because there is no clear error or event that has caused the issue. Instead, they may be an optimization that has existed in your system for a while. Or they could be a problem that is creeped up or is otherwise buried. Performance problems in large distributed systems can be surprising and causality is also missing or buried in all the Telemetry signals. A couple of real examples. We found was distributed and I'll PC was executed twice unintentionally due to a quote in the network architecture and evaluate
memcached grew in size without anyone noticing, and then suddenly became popular in the flash sale listing, networking capacity, both of these jumped out really quickly. From First Data Today we're going to be highlighting a couple key tools from the cloud observability. Toolset first is cloud trace. This is a distributed tracing tool. That is particularly useful for latency scenarios. But it's also useful in any scenario, where a downstream service is affecting Upstream Behavior. Dennis Powell profiler. This is a continuous profiling. Application that helps you find
exactly what part of your code is responsible for an issue. Do you think this as part of a larger tool set? This tool set is based off the same planetscale platform that power is Google's internal durability. So you know it's been battle-tested. In addition it comes with zero day, configuration out-of-the-box, 40 Google services. So for those who are familiar, let's talk about what distributive tracing is. Here's an example, application architecture. Your users May communicate with a front-end through load balancer. That will communicate in turn to
multiple back ends, including databases. However, today your application is more likely to look like this. Many small microservices working together for a common goal. This is a great application. Architecture becomes very difficult to troubleshoot. Distributed tracing allows you to understand exactly how the request is moved through this environment, which definitely simplifies troubleshooting scenarios with reality better, listen increasingly large set of services opening that my life and performance
is impacted by all services involved. In processing, your request including third-party services, distribute racing provides visibility into the performance of these increasingly, deep distributed systems that is almost impossible to achieve any other way. Next, let's review continuous profiling. Continuous profiling, help specifically with what happens in a particular service. The intent is to identify the method or section of code that is contributing to the issue or the source of the optimization Cloud. Profiler takes profiles of your
application at every function in a sampled way. This means it's perfectly safe to run in production. These profiles are laid out on a flame graph where it's visually extremely easy to understand how these functions play together and where the cost of an issue is We talk to customers the troubleshooting process typically follows these steps. You start with an issue and refined that issue, as much as possible. Using the metrics you have. Then you try to identify the service. That's responsible, this is particularly important for distributed systems.
Further refining to the method or part of your code, where the issue this is extremely important for efficient troubleshooting. And then finally looking at Deep information, such as logs, and debugging tools, help you come up with a viable hypothesis for what might be causing the issue. This is, of course, required. Before any resolution can be attempted which needs to be pushed and confirmed using the same metrics that. Identify the problem for you. In practice, we can treat traces as contextualized typically Trace instrumentation. Initially focuses on the edges in a service
graph, where communication happens between Services, the phone with problems, show up as large fans with large gaps between child's fans. We don't know what's going on. so the hypothesis part about process involves incrementally, filling in those empty spaces by adding instrumentation to out code until we gain some insight into what is different about this Puff now, let's see this in action, using Google Cloud, the durability Suite of products, We're going to show you a series of demos on our demo application online boutique. Will take you from problem to service the method
of hypothesis. Starting with identifying and refining the problem. We're going to show how to use cloud monitoring to find and refine the problem. You're attempting to investigate. As you can see here, we already have an incident related to lay in the sea. You can see from looking at the policy details that we've actually defined this alert to be 95, percentile latency sustained for 1 minute. This kind of alert configuration, ensures that only real problems generate alerts and wake up around call Tech.
Understand the problem further, we're going to open up one of our pre-configured dashboards. In this case, the performance dashboard, We can see a definite spike in latency Nursery light and see how this can be shown. Both from the aggregate percentile lines as well as the distribution itself to the inflection point. At this point, we definitely understand what to investigate which is grpc server latency. Next word, use distributed tracing to understand the source of that
latency Spike. With the problem. Well, understood, it's time to identify the service where the issue most likely exist. To investigate the cause of this latency. We're going to use a new feature called example, traces. This allows us to Overlay sampled traces on to the latency heat map from within the monitoring dashboard. Here, we can drill into the specific requests. The generated those metric data points of interest. In this instance, it's quite obvious to see that most of the time is spent during the get product
operation. We can check out a request to see if it's a similar pattern and indeed it is. To validate our hypothesis. Let's check some 50th percentile traces. From, before the spike. Investigate this further, we're going to open Cloud Trace. We can further dig into the traces that exemplify this problem. This allows us to filter for additional details including label values. And see the trend over time. As you can see the visual representation of your entire requested, face provides allows you to easily, see where the problem is in this case.
As mentioned earlier, it seems to be in the product catalog service. Get product operation, which is done repeatedly, and has been taking a much longer time than looking at races from before the spike. Eliana is going to show you how you can use Google Cloud. Profiler to identify where in the code this issue or optimization to be found to investigate, which functions in your service, code could be slowing it down. The default view of profile shows, you a flame graph representation of your service profile. If we can get the port side, view of radio
Services spending most of his time here we have selected, the product catalog service and we are specifically looking at the CPU time for the service. Just by looking at the flame graph, we can understand that most of the time is spent in this stack, which is the grpc receiver for the service. Walking down, the stack, we can see their parts catalog is spending is where you're spending most of your time, opening the top functions table and sorting by the total time. We can confirm that
68% of the total time, spent in parts, catalog walking down the stack. Some more. We can see that this time is spent in actually, I just sold on Marlin function. So, this should be the first focus of our optimization efforts. The form of viable hypothesis. Where do you use logging and bucking tools included with Google Cloud, observability Suite of products to jump to the related log particular Trace, In the previous demo, we found that the cause of the problem was most likely
within the product catalog service. To take into this further, we're going to filter that particular kubernetes resource pot. This can be done easily at the click of a button. And immediately via the histogram. It's obvious that something has changed around the time of this issue. Let's jump to the start of this problem to see what's happening. Similar to what we saw with him. Profiler this service is successfully Parts in the product catalog very very often and repeatedly certainly more than it was previously, the issue starting this
confirms our hypothesis, that the problem is related to parsing of the product catalog. The further dig into this week and inspect the messages from our logs, for like that, with the behavior of our code for you to bugger and ride. Find a bugger allows you to inspect what's happening within your running application by injecting snapshots for log points. Whenever you like in your code. Will show this inaction by going to our Currency Service. And let's take a snapshot
at line. 155. What we can see here, on the right hand side is instantly. We're able to see details of the local variable at the time. The snapshot was taken. But what if you only want to take that snapshot in certain circumstances, maybe I'm chasing an issue with only a certain region or a certain user will you can add conditions, these are written in the language of the code that you're the Bucking or Expressions which will print certain aspects of your application as you write them. Or maybe I don't want a snapshot that occurs once, but I want
to in a recurring basis, take a log point. Will you can do that? You can inject log points that will be run every time the code is hits not just once and again these can be written with any condition written in the language of the code to the Bucking. Anne with an E message. The editor allows you to do this easily, you can write whatever logpoint you want, as long as it will not affect the running bike code, and it needs certain memory and CPU footprint requirements. These features allow you to understand what's happening within your actual production application.
This can be a game-changer when it comes to understanding and bugging issues that only happen on production. Also, makes it much easier for operators s.res, or developers to troubleshoot issues without having to wait for a long deployment Cycles. But then you just seeing gives you a highly guided experience to troubleshoot performance problems using Google clouds of variability Suite of products. But sometimes the answer is not that simple. There isn't a single request, or even a set of logs that shows you exactly where the problem
is. This is where understanding the system and its request. An aggregate is extremely important. In order to help with this, we've recently released a feature or traces can be exported to bigquery. How do you label that you send into your span data gets written as unique columns which makes searching even easier this allows you to use the power of bigquery to perform custom analysis and also allows you to retain data for much longer than 30 days. Now I'm going to hand things over to Francis who's going to explain how this kind of
aggregate. Analysis is done at Shopify. Why we use the query for aggregate Trace analysis? I'm going to briefly demonstrate how we do that. Why do we need aggregate analysis? Approach to trace analysis is just good for interesting traces, perhaps by syllabus and time frame and either incrementally refine, I'll switch. Until we have a small dish set it free or hope that the Falcon also traces return to representative of what we looking for in Stockton randomly. Viewing a few random traces were, hoping that anything interesting we find is representative of the largest set of
phrases. You seem to like the query however, we can analyze all the traces and allow it to Sioux Nation. We still up and want to look at complete contacts. Google Top Choice provides the super auction. If a performing in-depth analysis is something strange I feel stiff is all ways to extract in this example with a generator. So we always have a question. You know, the time interval for the load test and the ID cards for 6. You suck. Why can't fight race IDs?
He was attracted about H-E-B stands for. Starter analysis, to figure out what's going on. So we can see around $45,000 PM. I should know she traces in the Collector's Choice or significant multiple of this request, right? Know that we contributed Justice. Curry console is great. He's like this results. The movie of Stephen Curry separately. And then we can go straight, talk Walmart. Calculate duration is distribution. Is what I look like. In this example, the key pieces, we need to know the source destination and request.
Reply to this is a little more complicated. Set a request for interested in how to size a set of nightstands with the pricing sound IDs and the set of service and that the person parents, ladies and then join them on the trace ID. With the parents, an idea of the soda is Appliance paint. Groupthink, is this a request? We were places without knowing. Controller is, let me know what's up. The result is a little obscure controller, which looks like the controller controller
controller. Looks like we're in the operations performed during those requests operation. Computer operation is a symbol of something with your relation in Ruby, by the standing. Ordering information allows us to quickly identify the fans and we can see the bicycle in British time. The service will for tax Band with the time spent in garbage collection, which we are interested. Alternative to request is reasonable. I hope that's giving you a taste of some of the types of analyses that you can perform over the Tri-State. Using the query. The key
elements are getting traced a 2in to be query. What you can accomplish with khafre sex, born in extracting interesting prices to a separate table. Typically based on the time window, this helps to control Cleary costs and greatly increases responsiveness and finally, it's extremely helpful to have a tool to view the query results at Shopify. We use mode analytics, but there are many other tools that integrate with baquery. Thank you, Francis, and thank you all for joining. Please feel free to reach out for follow-up and it's always visit us a cloud. Google.com.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.