About the talk
Many organizations have disaster recovery (DR) failover plans that are poorly tested and implemented, and they are scared to test or use them in a realistic manner. This talk will show how we can use System Theoretic Process Analysis (STPA), as advocated by Professor Nancy Leveson’s team at MIT, to analyze failover hazards. Observability and human understanding of safety margins and the state of a failover are critical to having a real DR capability. Chaos engineering, game days and a high level of automation provides continuously tested resilience, and confidence that systems will fail over, without falling over.
Adrian Cockcroft has had a long career working at the leading edge of technology, and is fascinated by what happens next. In his role at AWS, Cockcroft is focused on the needs of cloud native and “all-in” customers, and leads the AWS open source community development program. Prior to AWS, Cockcroft started out as a developer in the UK, joined Sun Microsystems and then moved to the United States in 1993, ending up as a Distinguished Engineer. Cockcroft left Sun in 2004, was a founding member of eBay research labs, and started at Netflix in 2007. He initially directed a team working on personalization algorithms and then became cloud architect, helping teams scale and migrate to AWS. As Netflix shared its architecture publicly, Cockcroft became a regular speaker at conferences and executive summits, and he created and led the Netflix open source program. In 2014, he joined VC firm Battery Ventures, promoting new ideas around DevOps, microservices, cloud and containers, and moved into his current role at AWS in October 2016. Cockcroft holds a degree in Applied Physics from The City University, London and is a published author of four books, notably Sun Performance and Tuning (Prentice Hall, 1998).View the profile
I'm going to talk today about failing over without falling over, and many of us have seen systems that we peel town to be built in the reliability systems on the plans. But what happens in practice is that we might be able to do about that. Start afraid. I like to ask people to have a backup data center have available apps to it. Get a few answers to that. You feel about how they to Central at once people typically give pretty embarrassed about that point. When I called his availability theater, if you could have
backup data sent you, and you've never really felt over to it, or you're not confident that you can sell it at 2. Attend moment's notice you missed a lot of money for a set of a facade of availability. And if we look good, send some data from The uptime Institute, historically most recorded outages have been because of power failure. Soul data scientist for all, that will report aims Advantage has it to do with i t. More recently, the most recent report,
the find that I see in network problems that moved into the late. So, why is this? Well, an interesting set of book about this sort of type of failure has resulted in a complex system. That's what we do and para Cruces normal accidents, because the other unexpected, turn comprehensible, don't control. You're definitely going to get these. So that's why they are normal. You should recall them as a normal outcome, and something that you should be trying to manage phone. Because of this, we try to build redundancy into my system. So, if something
fails, we could not send it to failover to problem is that the abilities of fellow is actually, sometimes more complex than the thing worth fighting over. And then the overall complexity of the antenna system is now much larger and we build ourselves, a more complex system, which is more likely to fail. That's why the whole system. How can we do better on this? I've been using this line for a while, because really? Yeah, you can only be as strong as your weakest link and you dedicate your teens. You need a security red team. You
need to actually put a little bit of the wrong analogy, to his an update to it. Think about this as a cable, or a rope. They have lots of strands. And if you get a few strands, broken, the cable or rope still works, but after awhile it down for the last few strands that are just enough to hold, whatever this dress is on the road. And when they break, then you could have a Leah even say where the cause of failure was the last few students that broke, you're not really getting the big picture cuz what happens here at you build resilience
systems. And lots of redundancy and then and then didn't see until he has prayed and prayed until it actually breaks. This is encapsulated in this, this great little clip up in recommending to people for a long time, drift into fania, which explains why this, because everybody with the best intentions locally optimizing app, every step in the process, will gradually consumed. All of that safety margin until the system failed. So, the way to the way we have to work from this is to capture and intestine, measure the safety
margins before things go wrong. This was a good example here in the airline industry will the plane airlines flying that model airplane, all of the aircraft flying got more than Jen whatever Constable system it is. So they built in Brussels companies to report on types of failure. We don't have that kind of thing in the software industry. Something Chris Pinkham said, years ago. He was the original eight, engineering manager for a lot recently about observability.
Can a cervical Latias detection pump, we are going to be able to respond in order to manage a failure. When you go to get a lengthy, observability to the control, which means there's some kind of modeling as I'm looking at this, that means something in his control response. This is the control system problem and there's a load of work. That's gone on over a long time and safety critical systems, and I've been reading up on trying to learn from these people. My laces favorite book is engineering a safer World, by Nancy levenson of MIT to the professor at MIT and
you can go look at the the handbook and the clay Helton online conference earlier this year. That I got a great idea. So there's a lot going on here, and I'm just going to give you a set of a view of how I think we can apply this. But here is a diagram from the book. What we have is a human controller and ultimately control it in the Control process and the observability. Pump is the census and displays but there's a list of the controls and actuators. The other thing that shown in here is a model of the Control process in the ultimate to control
that you wrote the Declaration, something like that and the human control them as a model of the alternation, in the middle of the control person. If you think about how this can go wrong. If you think about the the Boeing Max 8 playing Problem, the plane with the Control process, the ultimate controller with the store control system. They upgraded that systems have been working the pilots win trains in what to do. But we ended this new model is a sedan model line with the actual to mention. That was one of the reasons why that plane became uncontrollable and crashed because
he didn't know what to do. Now if we look Instead at the kind of it systems, we build, I just read relabeled this diagram. So we have the human control of his watching the strip of the system. Say the control plane is probably an hour or something like that and then is a web service to any customer request. What is TPA does is it looks at the hazard that could disrupt that. So let's look through some of these hazards on the same side. You could have missing updates, that could be
zeroed overflowed what? Kinds of things go wrong with your fuel system and coordination problems that software updates every time something to break the sensors or break up. Something that what you think you're looking at. Then this problems with the mobile mismatch, you, you don't, you're not looking at the right inputs, the update people seen that lock file scrolling up the screen, ridiculously quickly, so we can see what's going on. Updates may be too infrequent so you can update every hour.
And then the human actions, maybe you'll just not paying attention. Maybe you do and say things things in the wrong World. Maybe there's more than one human controller that has a different idea about what to do. Cuz that uses trying to control as well. / 8 is trying to fix the system and different conflicting ways of the same time and you'll run books are probably out of date. So that's a big relation by the time. If we look at this is disturbances and what do we mean if there's a big
enough disturbance to break the web service? So, the control plane is there to manage small disturbances. So if you get a bit of soap, request, maybe it's maybe there's a bunch of the control. Plane has some fluid analysis, which will cut them out as managing the date of plane and there's a limit to what it can do. You figure beyond that limit, you are out of control and social control play notes. The mission failed, you know that code broke, Ground Control, little blue things that are out of scope
of the controller, likes to the customers, even get into the system. That's not that's out of the scope of the control system and then places with all systems working, but it cannot fix the system, say the application crash truck to the state of Base copy restarted. So those are cases where the control panel compensate. Well, what am I dealing with that? Is to provide a second system, which can take over as I saw them through this diagram, it out, reconnect control plane for the writing service.
And then we have two more control planes for website as a web service fee, and we can switch between them. The problem is there's a lot more to go wrong much more complexity of the human controller, to try and model. And so how come we do with this? How can we make that better one way to simplify the human mobile is to use patterns and symmetry symmetrical patterns are easier to understand and that means that you can get your head around, what's going? Because you, you understand it more easily and then used to
ruling twin. One of the things about moving from data center to cloud. Is Hugo consists of automation control planes and count, as usual symmetries built-in because your automation is the same. Your configuration is code means that you'll systems a bill the same everywhere and Vincent time Services Goshen Sons regions does a lot more consistency. Even if you're trying to keep the data centers in sync. So who's the principal's if it can be the same, make it look and act. Identically be careful to not introduce things to break the Symmetry and just
maybe for a while. I don't really need to. So I'm going to optimize make this window bit different. It's especially from a Brazilian. If something is different, try not to pay for it over and make it look the same because it's going to behave and Faded in a different way and then Tesla's assumptions assumptions where the things really are the same and different as well as other aspects of the And one thing he was kind of had three ways to succeed. Replicate of the three zones in a region or 3 region triple active. If you can reinforce
regions, and I hope you have a really high value application. Sometimes you have to secondary Regents that I live it too. Survey. True. That diagram to be a bit simpler and have a three-way of replication. So this is the same general idea. What we have here. If you do it, right, you have some symmetry. So you want to have the same day too. So all data should be existing $0.03 for three times, and it should be independent. So given that you should be able to keep working with his own offline working. They should be no really slow down time. If you automatically can
fail to turning on to zahn's rather than 3. So let's say listen to work through this scenario and see what kind of houses and what's likely to happen in practice. So let's say this phone is offline. For some reason. Let's say there's a hurricane or flood or something. For whatever reason. So, the system now, has the detective that going on and notify, the controllers that it's happening, but it should automatically be raffle. The traffic retry request of inflamed and just keep going.
What's likely to happen is the control plane is this is likely to not clearly informed humans that everything is taken care of and because there's going to be a huge flood. I just went offline. And probably errors for this Sunday as it goes down to the kinds of things hazards. To each of your senses to say. Then there's the human control action. They shouldn't really need to do anything. But that confused. They working separately just configured Oakland's of things. And you feel so good. Now with this flood of work is, is is coming in but the
troll is a disagreeing. And they the date is out of date. And then the Zone fails and the flood of requests and restraint, Quizlet everything else without basic cases for labor of the failing of the problem. The way to deal with this sister to testing the game days because this is synchronous data replication. It should be consistent. You should be able to keep going up a few seconds. And if you regularly tested that will work, but if you don't test it and the first few times you
try it, it's not going to work and it's going to take you out so you can really guarantee that the first few times you attempt to do this and we do with Olivia shoes and then we moved on to thinking about multi-region. I feel like a multi-region. It's a slightly different situation because the cross-region replication is asynchronous and eventually consistent, and most people do manual initiated failover between primary and secondary regions. And there's usually a visible down time to do that. At least for the people in the
region that being filed over in active-active. Maybe the other two regions, keep running, hopefully, and you just fell over the first one. This is difficult to do. It's more significantly, more expensive than if you haven't got really good. Operational, operational excellence and place and you have a well tested. So, no failure to figure out. Do you have The most of the, the failure modes of the same that the difference is just on this, slide in that float because
this time the human control is have to initiate the failover rather than being automatic. So that when people redirect the traffic, if there is enough capacity in the in the remaining regions, than you put income in traffic and break everything, we try storm going on this weekend. So how can we make old is better than one? Key thing is a correlation desplegable have to be reduced to an actionable insights and feeding needed to do this. I'm just pulling in the space of the tools coming along.
It's difficult to do this, and these tools need work to maintain them. In part of your game day is going to be to make sure that you were look. How relation is working correctly and you are getting down to actionable insights. The thing is, you need a lot of head room in your ability system. This is a problem again on your appetite fast growing systems. Like I'm in the back of Netflix, we were going so far. We have breaking a monitoring tools break. So you need to test that. You have a lot of head room, or a
way to to downtown the floods and keep your availability system going in under the impact of Big big. They may well be good to have more than one way to see the system. If you're running a, local instance of might have a problem. So you go to figure out, how do you scale this thing? The other thing is retry, storms. He wants to prevent work application. This usually work on, too many retries. I would country tries to zero except basically entry and exit points to go, see lawyer subsystem and then reduced time out to
me really, really tight that you want the telescope time out. So that I requested, the age is no longer being processed by the time. You see what I mean to see. If you have the same time as through the system is not the word going on for acquaintance with you're basically being orphaned. The people that made that request and we need to request in the system. Another good practice here is to do counseling environment before introducing apps to it. Filled out
applications to it, and have some sort of try to survive the whatever you want to call this region, put in the jungle for a little bit, but make it a bad job on those. I have not passed that test try and gain Defiance of the team's feel proud that their system got through and other people feel the pressure that they should also be able to get through. So guess what? We have. Another thing we want to get to hear has continuous Brazilians continuously tested resilience.
We need like just like continuous delivery has test-driven development. In Canary testing. You need the same kind of things for resilience. And that way you'll building a failure mitigation to well-tested, and I don't care. If you call the cast engineering, some customers don't like the word continuous resilience in production, engineering in Sandstone, or whatever the wedding is. So sometimes you just need to get it done. And to avoid upsetting. I think what we're seeing now
as they just sent has migrated to Cloud these French on a nine-year-old Disaster, Recovery processes. And we moving from the scary on your experience to continuously test. Sexy questions that paper. I wrote about a year ago with Wi-Fi contributions section Taiwan and go Tree Road. Most of it doesn't unlock attempt to guide reliability pillow. Some blog posts on this topic and measuring response times and performance and things like that. And then you
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.