James is committed to helping engineering teams become more deliberate in how they build software through developing strong learning cultures, principled engineering practices, and holistic architectural thinking. He has worked with web technologies since 2003, and loves barbecue.View the profile
About the talk
RailsConf 2019 - Building for Gracious Failure by James Thompson
Everything fails at some level, in some way, some of the time. How we deal with those failures can ruin our day, or help us learn and grow. Together we will explore some tools and techniques for dealing with failure in our software graciously. Together we'll gain insights on how to get visibility into failures, how to assess severity, how to prioritize action, and hear a few stories on some unusual failure scenarios and how they were dealt with.
All right. Good morning everyone. It's the last day of railsconf. And you come to my talk. So, thank you. I am James Thompson. I'm a principal software engineer for mavenlink weed Bill project management software primarily for Professional Services companies. We are hiring in both Salt Lake City and in San Francisco. So if you're looking for work come and talk to me would love to get you on our team. Now this is a talk that is roughly similar to what I did in Los Angeles at Ruby, but I have added a little bit of material. So if you did happen to
come to my talk on the same subject at Ruby coffee will be getting a little bit extra and I tried to change the focus up a little bit. But what we're going to talk about today is failure and how to deal with failure how we can cope with the failures that happened in our systems in so I want to start with a very simple question how many of y'all when writing software have ever written software that failed? Yeah, it feels in all kinds of ways. Sometimes it's because of a hardware issue. Sometimes it's cuz the code we wrote isn't quite perfect or isn't even
close to perfect or is just garbage sometimes our systems fail for things completely outside our control and I'm going to be talking about a few different ways to deal with particularly those kinds of failure the kinds of failures that are difficult to foresee, but that can be mitigated. Answer the first thing that we have to come to terms with is that failure happens. We have to deal with the fact that everything fails some of the time it's unavoidable. We need to plan for it and we need to have strategies in place to help us mitigated.
And we need to think about these things ahead of time as much as possible. We can't foresee the future, but we can plan for reasonable outcomes reasonable ways in which things may fail and we'll get into some may be unreasonable expectations with some of the stories. I have to share but a lot of the systems that we build we can plan for the kinds of failures that happened in them. Now not everything I'm going to talk about today is going to be immediately applicable to the project that you're working on if you're working for
Donnelly a monolith some of the specific stories that I have come out of a microservice ecosystem. And so there's not going to be a perfect one-to-one correspondence there, but I'm going to try and present some ideas that should have General application regardless of the type of environment you're working on regardless of the programming language is you're working in regardless of the Frameworks that you're using in your day-to-day development. And I want to start with what I think is the most basic and fundamental practice. And what
I hope is an uncontested truth. And that is that we can't fix what we can't see. Now I'm going to tell a story about this but I hope that nobody hears thinks that they have such perfect omniscience to think that they can't they can fix something that they have. No awareness of of course, if you have no awareness of something you clearly don't have omniscience, but as we think about our systems, we need to be looking for ways to gain visibility into the first step in being able to deal with failure and being able to cope with the ways that our
systems are going to fail is to gain visibility into those failures and there's lots of different ways that we can do that this visibility a does in managing our failures and it does this by giving us a window into the many facets that help determine when and how and to what degree our systems are failing. Beyond are reporting an instrumentation their systems for metric capturing that can go a long way to provide more Rich context to help you understand why and how and to what extend your systems are failing and to help illustrate this I'm going to
share a story. Now we know that low visibility is dangerous and many contacts whether you're sailing whether you're in a plane or driving having low visibility is a hazard. In software, it is similarly hazardous, although not to quite the same life-threatening degree in most cases. No, I was working for a company a little while back. And this is a micro service environment that I was working on a system that was written in go not my favorite system in the world, but was able to get up to speed on it and we had very
extensive logging coming out of this service. We thought we knew a lot about what was going on. We had alerts that were set up on the log. We were able to Monitor and see what was going on or what we thought was going on by keeping track of the logs coming out of the system. But then we started to notice something strange and our staging environment. We started to notice that the processes we thought should be completing work actually finishing that while we saw the baby coming into the system and we could see in the logs that the system was saying you was processing that
data. We were not seeing the results of that coming out the other end that we expected to see and now this reveal to us that while we thought we had adequate visibility clearly. Something was missing clearly. We were missing some part of the picture here. Now in connection with the service I had started working on rolling out some additional tooling. I was already in other parts of our ecosystem for this particular service specifically bugsnag and Signal facts for metric tracking. And by rolling out these two solutions rolling out
bugsnag to give us a more context-aware way to see the errors in our system and also signal effects to track some very simple metrics specifically the jobs that started the job that succeeded and the job that failed rolling out those simple changes gave us an immense amount of visibility that we did not have previously. Fundamentally, what we were able to do was to go from what you see on the left here to on the right. And now how many of y'all just love staring at log files trying to figure out what's going wrong. I hate them. I actually find that long files are
damn near useless when you're actually trying to figure out what's going on in the system. Almost every log file is incredibly noisy and has a very low signal ratio compared to what you're actually trying to get out of it. And so things like bugsnag and there's lots of tools like this. We have some of the vendor floor here that they give you a much better picture of what's going on your environment when you start to see how many there are happening what environments are happening in and so having that kind of visibility changes the way that we interact with our application. It gives
us more information to be able to decide what we need to work on and when but that's not the only thing that you need in large systems are even moderately sized system. Just knowing that photo there's several thousand or tens of thousands of errors in a certain system may not actually be that much and in the case of this system. We weren't sure that even though we have what seem like high error rates we didn't actually know if that was normal or abnormal. And when the reason we didn't know that is it one of our data sources with a credit bureau and
credit bureau data is God awful. It is just a horrible dumpster fire in terms of formatting and consistency. We have numbers that can be either a number. Or it can be a blank string or complete be the string in a or occasionally can be something completely different that's not documented at all. And so that kind of inconsistency means that we don't know how much we should be failing. We know that we should be failing occasionally because the data were dealing with is just garbage, but we don't know how much and so that's where we
brought the metric stooling to bear this where we use signal effects in particular to get a graph like this. This graph scared the crap out of us. The blue line is how many jobs were starting the orange line is how many jobs are failing and there is no green line because no jobs are succeeding. This game is a really quick window to know that we had messed something up badly. And of course this was happening at our staging environment. Thankfully we had not yet ruled out a series of changes to production. And so we knew that now in our staging environment we had
something that we needed to fix a desperately. But if we're just been looking at bugsnag or or even the logs, which are not telling us exactly how many of our requests were failing would you end up being 100% if we have not had this additional contacts, we would not have realized how severe the problem is and we might have ended up chasing down other bugs and other errors that we thought were more important but actually worn or key problem. And so this is an area we're having this visibility give you a greater contacts to be able to figure out not just what is failing but
why it's failing and why what's failing matters why the system that you're dealing with is important and how big the scope of your failures are now there's some additional tooling that I am actually come to love recently from a company called log rocket that has the ability to actually show you down to the user interaction where errors are cropping up in your system. They can actually tie user interaction back to bugsnag or Century or are break or other are reporting services and you can then dive into an air and actually go
and see what did the user do that triggered this are whether it's client-side vs. Server-side and getting additional level of visibility is incredibly powerful and being able to figure out does this air matter? Enso visibility Cottages the table Stakes when it comes to dealing with air when it comes to figure out how do we how do we deal with failure in a reasonable way we have to start by raising the visibility of our air and giving ourselves as much contact as we can about those are so that we can actually deal with
them in a sane way. We need to pick tools that give us visibility not only to the health of our systems not just in the Raw are details, but also give us the broader context of exactly how bad is this are in that gets into things like metrics. Ensure that you're collecting the contacts that's around your errors as much as possible. This will help you to know how you should prioritize your efforts and again raising visibility of loan will give you a huge Advantage when things go sideways as they will in your apps. Do these kinds of things and you will greatly improve
the path that you have available to you when it comes to resolving errors and even more so you'll be better equipped to know what errors actually matter to your customers. The more context. You have the better the information you have to act on it is And that didn't lease to the next thing I want to talk about in that is that we need to fix what's valuable. How many of y'all have ever work with compiled languages? All right. Now how many of you are familiar with the Mantra that we should treat errors that we should treat warnings as errors. All right, if you y'all
absolutely do not matter. And even if you have an error reporting system and you have bugs that are legitimately coming up in your system, they do not all matter. Do you not all require immediate response? And so that's why it's important to think about what is actually valuable in our system. What is it about our system that gives it value to our customers to the consumers to the collaborators that work with it and let's make sure that we're actually prioritizing our effort based on the value that were either trying to create or the value were
trying to restore to those people. This is one of those areas were like outdated dependencies or security vulnerabilities or countless other issues that we tend to lump into the category of technical debt. A lot of it just isn't real because it doesn't have any value is not depriving anyone of value is not causing anyone to stay up at night and we need to be better about thinking through whether or not something that we encounter in our system that use an air is actually an error that worth fixing and this is we're having that visibility being able to figure out what is going on.
How big is impact and is it affecting people in a way that it actually depriving them of the usefulness of what we've built And so if you have product and customer service teams are within your organization before you start working on some are that you've encountered or that has been reported to you cross check with them. That's where there's another mitigation strategy could customer service help your users use the system in a way that's not going to bring about this error case are there other ways that we can address the problems in our system that
do not require us to invest some of the most expensive resources that most organizations have their engineering teams in fixing every little fire when it comes up. Instead can we let some of these things burn for just a little while so that we can focus on the things that have real and demonstrable value? So again, the more we can focus on value the better we will do at satisfying the people who actually need our software systems to work. They want to use our systems for a reason and we need to preserve that reason not preserve our own
egos in terms of what we're choosing to focus our attentions on now I want to get into some of the actual stories that deal with some unusual air conditions the first one connect to a principle that I think can be applied in some cases called return what we can Are the next door I want to share is about a different micro service working at the same company from the one I described previously and I will unique. Issue that came up when dealing with it. Now in this situation, we had one generation of a microservice
that we were replacing with a new generation fundamentally. They did the same thing. They stored individual data points track over time about what it means to be a business for our ecosystem. They track things like the businesses named when it was founded when it was incorporated. It estimated annual revenue whether or not they accept credit cards all kinds of individual data points, and we tracked them over time so that we can try and develop a profile of Business Health And in the process of deploying this new service, we had to come over
the migration strategy cuz we need to bring over several million data points from the old system. We did want to lose that historical context. So we designed a migration strategy that allowed us to bring over that data to deal with the fact that the database structure of the new service was fundamentally different from the database structure of the old service. And to preserve that history. Now that migration went well we were able to bring over all the data. All of the Cross checks came back came back fine. We were able to migrate all of the
services collaborators over and they were all able to get up and running using the new generation to service. Nothing went terribly wrong when it came to the actual migration from the old service to the new. What went wrong was when one of these collaborators added a new capability? That needed data from our service it specifically they needed some historical data from our service. And in doing this they discover that some of the data that we had migrated from the previous generation to service was garbage. It was corrupt. It had never been valid and in the
process of bring it over to the new system. It was still definitely not valid but because of how we a translated that data from the database we got hard application-level errors part of the reason for this was that we'd encoded these are these previous data values into a yam. Obey serialized column we did this cuz it was easy and because we didn't know exactly what the shape of this data structure was going to have to be long-term. So we use the easiest thing that we had available to us, and that was a gamble serialize column. That we eventually migrated to a Json
native called in postgres, but for this initial rollout, we had a problem because when we start asking for this data, we were getting yellow Mopar surveyors, whenever we try to deserialize these garbage values out of the database the gamble was not well formed and we get acai care and so will counter this our service return to 500. This was unreasonable expectation for our service that when it had a date couldn't couldn't handle it would are the problem was when our service aired. It's collaborator. Also then aired
which then cause another Upstream collaborator are which then took down an entire section of our site one that was responsible for actually making money. And so we had a cascading failure that resulted from this very low level issue with database values not being the way we expected them to be. Now we had multiple problems that need to be fixed in this contact for the easiest one to fix was to Simply rescue that site parsing error. And that's what we did. We looked at the data we looked at how it was corrupted and we realized that the reality of this data is that because
the data is corrupt. There is absolutely nothing special about it. It's essentially an empty value and because it's an empty value we can simply return no anytime we encounter this corrupt data. And so that's what we did. We started returning know anytime. We ran into this parsing error. And yeah, we might be missing some data occasionally just because the date is malformed, but the reality is if we can't parse it is fundamentally worthless to us until returning null with a valid option for us. And so it's the one that we pursued.
Enter this is an example of where we were able to return what we could we have lots of different data points. But if we encounter just one that was corrupted that we couldn't understand we can return them all in that case. And in many cases, especially a microservice environment and in Minneapolis station contacts most application contacts that I've worked in returning something is better than returning nothing and it's absolutely better than returning a hard are in most cases very rarely does all of our data actually have to be complete in order to be useful and valuable.
But we are accustomed to thinking about it as if it's an all-or-nothing proposition. And when do you think about how to have less dependency between parts or systems not more and so in those kinds of situations a great way to start decoupling and loosening the connections between our system is to start thinking about how little can we give and this vote in the state of still be useful. That's why I think there's a lot to be said for returning what you can and coming up with Shane blank values to return for the things that you can't.
Another Prince want to talk about is that we also need to be accepting just like we need to be somewhat cautious in terms of what we were turn and try to maintain a working State whenever we are having people ask us for information. We also need to be very generous in terms of what we're willing to accept. This is the notion that acceptance is often better than total rejection. Now in the same service, we had many collaborators almost half a dozen or so and all of them had only a small
piece of the total picture of what a business was. We had some data coming from credit reporting agencies. We had some day that was being supplied by users through parts of Marin or interface and we had other day to those coming from other sources and none of them gave us a complete picture of what a business was as far as we were concerned. and so we need to make sure that we're able to accept as much or as little data as each of these collaborators is willing and able to give us as we designed this system in such a way that it does not require you to
send it a complete version of the business profile object. It looks up one field if that's what you've got. It looks up all the fields if you have them. And so in doing that we were able to make the system in such a way that it's very permissive in terms of what you can send it as long as you can send it some data it will figure out what it can save and what it can't in addition to that. We also made it resilient in such a way that if you sent it five fields that were great and one field that was total garbage. It would simply return it would accept what it could and it would tell you what
it couldn't accept. Cancel that was a very important detail as well because instead of having to say well because you sent as one data field out of this sad that we don't understand we're going to return a hard are two instead. We designed the system because we made the decision that accepting some of this data was better than accepting. None of it that we will be forgiving in terms of what was sent to us. We would accept everything that we could understand and we would warn the collaborator about what we couldn't understand and so this ended up giving us a system that was much more
resilient to the variety of sources that we were only receiving data from that. We had no direct control over and that we're going to have varying degrees of quality in terms of what they were going to send us. And so we want to make sure that as we're building services that we're being forgiving when it comes to the collaborators with those Services. We want to think about the day that have to go together and the day that we can accept independently. If we do that our services will become more resilient to failure and we will have to fail much less often.
Now the final Prince want to talk about is about this idea of trust and that we should trust carefully. The reason that trust is important here is in any software system. We're going to have dependencies and those dependencies imply a level of trust in whatever it is, whether it's another service that we're collaborating with whether it is a dependency that we bring in a ruby gem into our project. We are trusting that thing to act reasonably and to not break the rest of our system, but that trust is very rarely deserved.
When we depend on others, they're failures can very easily and very quickly become our own. And this gets back to that story that I originally told with regards to the cascading failures within our system. We'd essentially build a distributed monolith. Not a collection of microservices when a failure in one service called the failure another service that cascaded all the way up and cause an outage in a portion of our site for all of our users. There was too much trust baked into that system. And and this was definitely too much trust because of course we have
multiple teams working very Loosely together, which is the idea behind microservices. It gives you this ability for teens to move somewhat independently of each other. But all of these teams had been very trusting about the nature of how our services were going to respond about how reliable they were and that led to a cascading failure that led to an outage for actual customers. it's a we need to be mindful of this whenever we use somebody else's Services weather inside or organization or outside when we use Google or Facebook
or off 0 or anyone else to do authentication for us their failures can very easily become hours if we're using S3 or some other storage back into host files for our system their failures can very quickly become hours until we need to be careful who we trust and we need to make sure we have mitigation strategies for when they fail not if they fail Because eventually AWS will have another big outage. Eventually, Google will have another big outage eventually our own system will have another
big outage. And we need to be thinking about how can we build our systems to trust a little bit less and to respond? Well in the face of these kinds of failures, how can we design them so that they don't fall over when one portion of what they are doing stops working the way we expect it to This is also one of the reasons why I think most teams are nowhere near ready to adopt microservices. Because they are not yet ready to deal with the fact that you're trading out method calls for
distributed systems latency that we are dealing with systems that are inherently complex and in most cases most teams do not yet have the skill or the knowledge to actually be able to deal with that inherent complexity yet. And this again is a problem that can come up in any kind of collaborative environment. No matter who you're using no matter what the scale the scope is. There is an opportunity for other people's failures become yours. Now. This is not an argument to avoid
dependencies. Although sometimes that is absolutely the right decision and you should not depend on someone else for a critical service or feature. Rather the point is to be careful about who we trust and why and to what degree? We want to make sure that the systems that we collaborate with that our systems depend on are as reliable as they can be but also that we make plans for when they do and we'll break. So trusted carefully expect others to become unavailable respond graciously and gracefully in those cases don't return a
500 or any other kind of air. If you can avoid it try to preserve the user experience in some meaningful way think about the customers the other developers whoever is that is using a relying on your services think about them as you are building your systems and thinking about the ways that they might fail The big takeaway is that we need to expect failure. Does the key ID at the heart of chaos engineering. This is a key concept that I think we all need to Grapple with more deliberately and more intentionally, even at small scales. We can't
avoid things breaking so we need to plan for how and when they're going to break. Edwards the first step is to make sure that we have good visibility into our system. Not just log files not just are reporting services, but that we have meaningful application Level metrics that allows to get a better picture a more complete picture of the scope and the scale of our failures when they happen. This is how we can be better at preserving value and restoring value for our customers when our systems break. Thank
you. You can find me online not going to do to you in a but if anyone has any questions, I'll be hanging out down here afterwards come and talk to me. And if you're looking for work also come and talk to me. Again. Thank you for coming out and there's the link to my slides as well.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.