About the talk
This talk provides an overview of the categorization of outages that happened in Uber in the past few years based on root cause types. We'll start with some background information, including definitions, incident management framework, and existing preventive techniques, aka best practices. Followed by details and rationale around individual categories, sub-categories, and their relative distribution. Then we'll deep dive into two of the biggest categories: deployment and capacity with a focus on time series based data ming techniques to assist detection and simulation of some of the common root causes. Finally, we'll discuss the propagation of lessons learned in terms of policy and process changes based on these insights.
Software engineer, system administrator. Automator of everything. Core chef maintainer. Member of Chef Governing Board. An ex-ThoughtWorker, and ex-Googler, ex-PagerDutonian I explore and extend agile development and project management methodologies, large scale system automation and data mining in the domain of web operations. Currently Senior production engineer in maps productions engineering team at Uber. Lifelong opensource enthusiast and tinkerer of everything related to computing, from hardware till software.View the profile
Hello, everyone. This is Angie from Uber. I work in the group and I'm here to talk about concerning resiliency with the data that incident. I don't know what learning around analyzing and learning from around 2010 and over the last few years and how we have used those data to improve our overall picks pack reliability and with an emphasis on the call Taylor Hospital. Before we begin with, let me give you some background around. All this talk is find out what is a world divided into three sections. We'll start
with some basic and contacts little talk about what do you call a person has? And what kind of tools to use to get logos? And finally, the main body of the talk with you and irrational for this categorization. And then we'll jump in the last couple of key learnings from all of those insolent languages. Can you start with a few in a background details on what constitutes an incident at Uber? An incident at a partial or complete the decoration of a service writer.
We use our product, any incidents can be caused by or most of it is caused by our text back, but it can be also unplanned you to be no legal reason. I'm kind of students are generally. 11 to 30%. Incident generally are discovered by our alert, but it is not always the truth. Sometime, in case of blind incident,, the most violent incident will call letter reporting crank up date. I don't know what is happening. Once that happened to General Lee, all the old folks, I will get into motion and we'll try to fix it as soon as possible
soon. After the incident, as most of the time, during that time. We do not know we meeting at it first and then, Circle back and do the root. Cause analysis is, what is the inducer impact? I'm so full. After this early every every group on every team or individual all got different level of do. We go to this incident at the periodic basis? It can be with you to schedule depending upon how critical which level of organization remove intuition. After all of this has happened.
Process all kind of see, no, people and cultural studies are the different personas depending upon what is happening and what? All the books are involved and different folks. Okay. So now that I would have some idea on what constitutes an incident at Uber, and how late is manager handle at a company with me, a start with the basic properties of individual incidents, which is 7000 + 3 years, is that the disc categorization for us to learn from this incident? So, that all the learnings and lessons can be easily
integrated back to the system as a feedback cycle, right? So it can be categorized in several different ways. Among the three properties that every individual incident. Carry. The first one is the domain of The Incident Management. All sorts of cool deploying configuration reply, think, like a basset management located on the card because the domain is the internet back. So we don't service our current interest including storage systems management. All evolution of their own and we get
incidents on those also. Things like in external incident things like a ice related stuff. We have lot of a payment and strengthen other indications that we mind has been on our side, but just to give some idea of the Year 2525 and the rest is in all other types of like a tattoo food, or you can only see if you do want to compare to in. How bad? You know how to build a Singapore X is basic property destruction. So next will be jumping to individual categories. And each of those categories. I will talk to you insolent
and do the slide sticks and that their titles are only providing the under the gas indicator. What where is the Ender block? During the shows the time City's property for broad categories to be going to school deployment, or could change into types of a chicken in coronavirus Tumblr? And then I'll make something like tropical shift is dropping. It can also be a dynamic configuration changes in both. And also, where is, which is kind of a dark place where you have deployed to court and you have Gatorade by
configuration change in Behavior incident, one of those services. The competition was Annabelle weeks are lost in it. Started listening either it was. And that's how we first lost. Have the crippled. And then something to note here, but the incident itself is denoted with all represented and taste like incremental deployed and it seems like a real back are very relevant for detecting and me to get in this. So if you have to blow me, one thing for reliability, probably go back and change management. Deployment magnet is the single most important thing.
The second category capacity in the incident can also happen from my media dusko causes. But there are three or four big type of capacity dude. Had not considered that we have not considered under, you know, we have not planned and found him by surprise when they can be changed. Our stock is composed of thousands of Microsoft, business and individual microservices call each other and there's sometimes. What's the time slot in the graph? Represent the downfall by covid
having more and more to keep up with this up, with its capacity, example, of capacity, and that had a connection handling issues. Super not opening. Closing Connection properly and that lead to find the scripture August weekend. I'm kind of bad behavior. I'm doing. I'm singing a bigger because, you know, a few dozen broadly categorized in networking and storage on a computer. The common issue we have seen is that are you in English. I actually won service and it has multiple
instances of that particular service inside of an individual amounts of different and different latest approval. For each of those clusters. That traffic was a gallon. This is David again. I'm using too much are too. Why you so stupid to denote the condition during the outage in a different type of quantify. In computer. We had seen sometime noisy neighbors. Which can you are looking to be sippy largest cargo Network, depending upon how you handle things are. Not always reveal
something. Networking specialist are not being used. So when they can impact the behavior, you know, inside the host from a different phone was acting, I picking up local resources leading to some aberrations also selling for all of the containers. The latest high-profile arms are from another container from the same house. And you can see an oscillation link to that. It lists the individual grass types that are very similar to our experienced it, so you can get a
group of students in different functional groups and doing. So allows the use of the best possibility to incorporate black bag with learning to individual group and systems because Understanding of teaching environment in all different kinds of the best. Give me the bestest to streamlining the feedback to settings for my behavior. With that said, I think I would like to take the last few minutes to talk about couple of hour. I said I want to share a couple of skill and the first and foremost.
This is a superset of learning or on it in Versailles on the on the well-being of our dramatic. A process of creating incident and particularly folks involved in the incident, in the healthcare sector or fire in kind of those are the institutional evidences and they are the ones who know most of what happened. So they're in there and courage to admit. Cap City and they stay in power is off at most critical in 2017 by a bunch of other things. And sometimes when you are working with a complex system in a, given a
short span of time, not everything could be reduced in a sick time. And so nobody can be held accountable for those things were relaxed and as a result, they buy 2018. We got a lot more data and that block feedback into the system and over a lot more than 2017 from this process without that kind of control ship. It is not possible to even get to this kind of data. The second part is the avaricious engine, any part of it, which is basically the same time in parallel
table, Landing from alcohol, take to learning some Incident Management and added back into the public management cycle on boarding, right, shadow shadow on, call Transportation. All of this, how you do in the afternoon. So, I look at the numbers that we keep adding a Sprint planning. Please celebrate. And those are critical part of your soul and making them as a complete part of the development lifecycle methodology within this domain. Like they can see how this thing lines up for it as of today. It should not be treated as a thought. This was all I wanted to share but
thank you so much for doing and I have not been able to covered everything that I could because it's time we supposed to the time cuz I have added a bunch of different. This definitely engineering directions to Roundup newsletter is the one I would strongly recommend A Certain Justice process. These are some of the cultural highly recommended for anyone wants to know more. I have made a white paper from a power distribution company. And electrical grid are very
similar to what is our learning. Understanding about the incident space. Thank you. I didn't get a chance to meet when I was at Hoover. But I was there in 2016 to 2018. I was in a story or Saint Cloud infrastructure and then moved on to Uber platform. So it's really amazing to now. See how much incidents have grown in like now seeing the categories like, it was really cool to see that it's like, throughout put utility and then those golden signals like it really is key to be
able to start learning from incidents. So I need, I'm going to go first one of the questions that we got ass from the public, but then I had one or two questions. I want to throw at you. So the first one that we had a comedian. Folks, were asking. What is the first step in getting started in building? This resiliency? Is there? One thing we should go out there? Like, what would that be? How to say that set up a process for capturing it and set up some understanding of what is normal,
you want to call the bar, depending upon whether you have a high to put business or you're just starting off someday. You might have to build it to someone understanding, this is what you going to track because everybody in the in the in the product I can get into emotion. That is like, best advised that we really don't hear out in the industry right now. It's amazing to think about. What is it? That really does. Say our systems are normal, and then kind of like normalize it across the entire stock or at least for those like someone stuck to your son's
5 informant, external Isabel, high-impact incident and school Used to Know level. And so 11 is the impact and the City Museum in st. Louis County, Regional and Global. So having yours are in one of the most serious, but as long as you have a common understanding, that is a starting point. Do you think. I got, I do you touch the van? You have to take a lot of learning from like, what you're doing. I'll call the incident. So, you go through and I know you mentioned onboarding. So I remember like education, we had calzones near me being part of Education. Like, do you think that
now, that learning of like, what is the date of that? Every single engineer needs to be? Keep in mind for their job, is that some things that y'all have now implemented using the process of people in terms of food, places in terms of joint know everything. So, we have at multiple steps of our induction program and reliability. Bishop and that involves not only introducing and I'm sending a relationship going to leave. So I know that was one of the things that we were trying, when I was there as like, here's the run, both of you actually know what to do. Cuz it's you, you have
to know where that. It is exactly what you asked, you all to stay in touch and post Communication in your outward-facing top warm. So, I one of the questions that I kind of wanted to have is like, how is it that? Y'all are like, taking the learnings from these incidents in like. Also, now doing the capacity planning, like, I know right now, it's like that time before off-peak season kind of kicks offer for y'all. Yeah, it's a sexually. It's it's a pretty interesting time
and we are on our way to significant change for many of the services. So probably, you know, Legion level of 10. Chili and we doing weekly or bi-weekly basis that we were just doing complete fell over exercises like every single week and it was like, one of my friends was involved every single week. There was only like someone was busy, but it was that load testing the traffic during. Do you know everything we're going to do is a 8 p.m. 4 a.m. Like. I think somebody was leaving up
that project. I was like, this is not the channel for communications integration cycle for us for all that. I'm looking for emergency preparedness and every week. I buy second-hand to my next question. The last question before we close out to Sun from the rest of the questions for Q&A channel. The question is, is there schooling available that helps you automatically categorize Metro store in an outage. Yes, and no, we as a community. We know different in in different aspects. What are the different tools? For example, of the things? He's very popular
even with all of these individual everywhere. So the context is very important. And what is the other problem you have to be able to explain what it is? Find me arsenal of MLB's techniques of detection and he's all not even maybe this but a diamond setup available to talk sense into individual relevant and signals is the ark. I think that is a yes. That's what your beautiful ending to that. Question is like we do have to do this as a community and it's about like sharing those best practices. So thank you so much for your
talk. I think you gave so many nice little resources for folks to follow up. I no more questions are currently getting posted. So you get a chance to now. Answer those questions in your queue, any channels, shred day, and we will have to thank you for being such a great speaker and contribution to chaos,.
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.