Josh is a full-time programmer, part-time human, whose interests include weird programming, physics, math, and trying to make software reliably be better. When he’s not writing code or equations, he’s probably biking somewhere or watching something on HBO.View the profile
About the talk
What is the sound of a zombie screaming?
Race conditions are a problem that crop up everywhere. This talk will go over what a race condition is, and what it takes for a system to be vulnerable to them. Then we’ll walk through four stories of race conditions in production, including one that we named the “Screaming Zombies” bug.
You’ll leave this talk with a greater appreciation for how to build and analyze concurrent systems, and several fun stories for how things can go amusingly wrong.
And if you were wondering about the question at the top, the answer is: Silence
Josh is a full-time programmer, part-time human, whose interests include weird programming, physics, math, and trying to make software reliably be better. When he’s not writing code or equations, he’s probably biking somewhere or watching something on HBO.
What is the sound of a zombie screaming? What does that have to do with race conditions condition? Who is this person sitting inside your computer? Telling you all about race conditions, zombies and screaming? My name is Josh, I have been developing software with Ruby for about five years. I currently work for a company that you may have heard of it. And almost certainly have used called Braintree. We deal with Payment Processing, usually credit cards before that, I worked at an Internet of Things
platform where we tried to connect things to the internet. We had those goal, we would do all sorts of things. We mostly focused on printers and in my past life and lives. I have used both Ruby and Jabba fairly extensively. This is also my first time speaking at rubicon's. I've attended many Ruby comments before and I have always been eager to try speaking and this is exactly how I imagined that it would go. Let's talk about implementing. This is you might think one of the more
simple operations with a computer can do and then a sense it is. But it's not really. Just one thing, incrementing is two operations at the start of the computer level first, you load the value that X has initially and then you figure out a new value, Xbox One. And then you would sign that new value to the variable X, to whatever location in data. Access being stored in, The thing that makes that important is the idea of a of an operation being Atomic. So x, equals x + 1 is not Atomic because it happens in multiple steps. In fact,
most things aren't Atomic, most database, operations, most list operations, or or data structure, things. Most things are not atomic. So let's look at why that matters here, we have what's called a sequence diagram. Those boxes of the top represent things. The are the lines that fall down from the middle of the Box represents. Kind of that sings Forward Motion in time. So time starts at the top and goes towards the bottom and then the arrows that flow between them are messages that are being sent or passed between
the things. So what we're seeing here is to running processes, each labeled code and they are running the code x equals X plus one. They're running the code. We saw two on the last slide or two slice ago. So the first code comes in. So the way will read this as well, start at the top and let her eyes drift towards the bottom. So we start at the top and we see that X is assigned a value of 0. Then the next thing that happens is the code on the left Springs in existence and ask sex. Hey, what is your
value X reply? Zero, the code on the left has some sort of extremely complicated math, figures out, that 0 plus one is in fact, one. And then the signs one tax comes in and does. Well, basically exactly the same thing except a text starts out with a value of 1 to the code on the write a sentence about you two to it. Here's why we care so much about this, being a Tomic or, in this case, not being Atomic, there's nothing that guarantees that those arrows will get clustered together. So if these two, if the sequence of operation to happen in a different order, like so,
Then we would get the wrong value for x. So what happens on the left is that the code says, Hey x, what is your value? Learns that the value of 0? And then, assigns one to the code on the right queries X before the code, on the left, got around to assigning it. So, that means the code on the right tries to update with stale data. This is the secret sauce of a race condition that means is we have multiple processes that are interacting with the same shared data. In this case, the two processes are the two codes on the right and the left, the
shared data is the one bite that is storing the value X. Those, those multiple processes rely on things happening in a certain order, but then, the things don't necessarily happen in that order. When things happen out of order, that is a race condition that in the associated bugs. So now I'm going to show you just a quick puzzle, which is this, imagine that we started out with x equals 0 but instead of the codes on the left and the right each incrementing X1, let's say they each increment x 10 times. So, the question is, at the end of all this, when we print the value of
x, what are the possible outcomes? So, now is the moment where I would recommend muting for just a couple seconds, 30 seconds or so. Until I move on to the next slide in case you want to figure this out on your own. Cuz I'm about to tell you the answer. And the answer is of value will get printed. That value will be any number within a certain range that range has as its maximum 20. That kind of makes sense. There aren't enough increments to cos x to go above 20. The smallest possible value is too. And so, I'll leave it as an
exercise to the viewer to figure out what sequence, diagram could, cause X to wind up as to when it gets printed. And now for storytime going to tell you for stories Each of which is a moment during my career when I came across a race condition and it caused some kind of bug, it might have been a bug in it passed or in some monitoring tools but in each of these cases there was a race condition. I'll tell you how we found the race Condition, what Kind of impact it had and what we did to fix it.
So Story number one is called Murphy's Law. Smurf was a tool that we had it Braintree that we have at Braintree, still that just gathers up data and sends it to one of our banking Partners. This is the thing that you know, in a payment processing company has to happen all the time. So the mechanism by which it does that is various parts of our system will write data to a bucket in S3 Smurf with and read that data from that same bucket do some processing in order to form a potato properly and then Set It Off. Of course, we have a code running in different
environments. So in order to properly separate data, we have one bucket for prod, one force, and one for test. The one in prod is appropriately, access restricted. The one in the sand has Sandbox data in the one in test gets cleared every so often and, and repopulated during integration test runs with with data, the integration tests know about, Let's drill into test little bit so here's a typical integration test run. The test will start by ceding some data into a story and then telling Smurf to do its thing and then making an assertion on Smurf that it
did the right thing with the data that was seated because one of the things we care about his hat, that's Smurf, handles changes in data correctly. We do this in two steps. So we have a step one data, and a step to data. Here's the bug. We started to see failures, this happened, kind of sporadically, but especially once we started actively developing on Smurf. We started to see failures that were associated, with the expected to step two results, but got the step 1 results. So, what happened? Well, imagine the two runs of the tests are running at the same time,
we have tea one over on the left. Let's say that that's RC I set up running tests against the main branch M&T to over on the, on the right. Let's say that. That's me. Running the test on my local machine, Test on the left are just wrapping up. The test thing, has ceded the step to data and Smurf read that and then does the right thing, right after that my test start, until I see the step one data and then etcetera. Well again, none of these operations are Atomic.
So that means that instead of happening in this order, what is T2? Seeded the the bucket with step one? Before the Smurf running for over, on the left, read its data. Then we would see something like this noticed that the lines as far as the lines going down from the T1 box and both of the Smurf boxes and the t-two box. They're still doing exactly the same thing they were doing before. But now, the tests over on the left, don't see the right data because Smurf on the left, read the
wrong data. This is exactly what causes this kind of failure. We expected the step to data, but we got the step one data. So the problem here is what I kind of alluded to before. We have one bucket for test. Well clearly that's not enough. If we have multiple test runs running at the same time, what we need is different buckets. That way we won't have different runs of the test stepping on each other. This turned out to be kind of complicated to do, but once we had it all set up, this ended up fixing the problem completely.
Story number two, out of the Void. as a Payment Processing Company, we deal with transactions, you know, that moment where a customer shows up on a Merchant's website or in person or something like that, says, here, have some money in the merchant says, hurrah here, have some goods or services The life cycle of a transaction can be well kind of summed up like this. This is obviously very over simplified, probably slightly wrong but I will just walk through sort of the how a transaction
moves from being the glimmer in the hopes and dreams of a merchant in a customer to being money that has actually moved from One bank account to another So we'll start out with a transaction in the pending State, then a then we actually check to make sure that the the credit card is valid so that's the authorizing State step once that's happened. The transaction is in the authorized State. That's usually when your your credit card is approved or you know when the transaction is approved when your credit card is accepted. Then we do
at some regular Cadence a thing where we move all of those transactions into the scuttling. And then settled states, that is basically the stuff for the money moves from One bank account to the other. As far as your credit card statements concerned, that's when a transaction will move from the pending section up into the posted section or down, depending on how your statement is laid out. so, let's drill in on that settlement process, This is kind of what it does in a batch process cuz we don't want to iterate over each transaction. They're just enough transactions. That that would
be stressful soon. A batch process. We load up all of our transactions that meet some criteria and we say are you off. And the transaction says, yep. And then the seller says, okay, let's start settling you now. Which usually means Marquez settled or Mark is settling and then do the next thing. This is not the only way a transaction could go though. After a transaction is authorized before it's settled, a customer comes in, and wants to cancel it, for whatever reason, or maybe the merchant wants to cancel it, for whatever reason, then we avoid it instead, and then the
transaction moves into the void, So the way the sequence diagram for that looks is this where a customer comes in, says void, the transaction says, okay, I'm void now. You can, you can move forward and then the seller comes in and says, hey transaction. Are you offed the settler sea? Or the transaction says, no. Actually, I've been voided, please don't settle me and the seller says, okay, I'll move on to other things. Here's the bug. He's only every so often, very rarely but more often than zero a transaction that was voided
got settled. This is obviously confusing for any customer who saw it. You know, I avoided you, why did you settle So, what happened? Well, if the settler ask the transaction are you off? And then after getting confirmation that a transaction was off, then the customer comes in to avoid it at that point. The settler doesn't know any better. So the seller tries to settle it anyway. And other words from both the perspective of the customer who avoided the
transaction and the settler, they're still doing the right thing. The customer thinks that their transaction has been voided successfully, but the settler thinks that this transaction should be settled. This is a thorny problem to fix the fundamental thing here that we need to make sure of is that we're not ever transitioning between the settling and voided States. that's easy enough to do in One Direction because When you try to void a transaction, that happens in the context of web request so we can always
block that. We can always check the transaction and say you're settling. Okay, then you can't avoid, but going the other way, is a lot trickier because it's in a batch job. And because it's in a batch job, that means that things can happen after we load it. We don't want to lock the whole transaction stable that will freeze the system. That's a bad idea and we don't want to check each transaction right before putting it into the settling State because that will end up incurring a lot of requests against the database. So we ended up doing, and I just think this is
so diabolically clever, what we ended up doing, was throw a timeout in there. So, what the settlers does now is, instead of immediately settling a transaction, it marks the transaction as you're about to become settled. So then in this use case, the customer comes in tries to avoid the transaction, the settler tries to Mark the transaction as you should. I'm going to try and settle you which the transaction accept happily. But then when the seller comes back and says, have you been voided or basically should I still settle you? The
transaction, says, yes, I've been avoided and no, you should not settle me. The reason the 66th, the reason the 62nd time out here is so important is because the, because any web request has to complete within 60 seconds as part of the the HTTP standard. So, as long as this happens within a 60 second timeout, and avoid request that were initiated before we go after we started the settlement process or side before we started, the settlement process are guaranteed to complete it. So what happens if a boy to request happens in
the other order, if we Mark, the transaction is settling Before the void request came in. Well at this point, then the transaction knows that it's about to be settled and so says, no, I'm sorry, you can't voice. You'll have to find another remedy for this and then the customer, they maybe they're not the happiest because of transaction. They wanted to avoid. Now, they can't avoid but at least they don't think that it voided when it didn't. So now he's gone through two fairly different stories for handling race conditions,
and the impact they have are very different, the place where they the type of of defect that they cause was very different, and the remedy was very different. So remember that the the secret sauce of a race condition is multiple processes interacting with data with the same shared data? Expecting things to happen in a certain order and then having them happen in a different order. So the first remedy, which is what we used since Murphy's Law is separating, the data is basically have multiple processes, but have them use
different data. That does make the problem. Go away entirely because now we don't have the share data anymore, the remedy leaves and out of the Void is reading the race. Basically we said, we want things to happen in a certain order and we can actually guarantee that they will. So we do have a couple of other types of ways that we can remedy race conditions and in order to do to Deep dive into those, well, we have two more stories. Story. Number 3, goes to the in machine. Yes, that is not a typo. Although
considering the, the handwritten nature of these slides, there are no typos Anyway, go to the in machine. I mentioned that in a previous life, I worked on an Internet of Things platform. This is kind of how that works. You have a thing and you have an internet and sends data to the internet for the internet to process. In this case, we're sending two types of messages. The sensor data, in this case, temperature reading in a heartbeat. If it says I'm online, you can continue marking News online. So here's how the internet
thinks about the things. We keep track of the sensor data and the status in the last time you were updated and the timestamps important because the internet is a terrible place and you might end up with data that's stale or something. So he gets stale data like in this example where the temperature reading that came in is actually older than the day that we had before. Then we don't want to update. Like sure you were 65 degrees at that point, but still the most recent reading I have for you is 60 All right, here's the
bug. The bug was for some devices. We noticed that their sensor data was just never getting updated. As soon as you turn it on, you get one reading that got reflected in the server. But after that nothing we just saw that that value fries So, what happened? It comes from kind of a subtle way that that that particular device was sending heartbeat and sensor Davis. Specifically, it was sending them as separate messages and we had one device. That had a sensor that was really delayed.
So heartbeats would often get sent at the right time, but then sensor data messages would get sent delayed by, in that they were already. So they were pretty much all worries. Always already stale by the time, the server received them. So in this case, even if that 81° was up-to-date at time 30 seconds because the heartbeat came in and updated, the timestamp to 40 seconds, the sensor data got dropped is stale. Okay. This one's boring because the servers data model. The fundamental problem here is that the servers data model was just not built to handle heartbeats. And sensor, data is coming in
separately, and it certainly wasn't built to handle messages reliably coming in the wrong order. So the right remedy here, remember we talked about some remedies already, they're the sort of right remedy here is to restructure how the server thinks about its data model to change things. So that it actually handles those messages in a way that makes sense. For the devices, I'm going to call that do handle good with sort of all the hand waving us that that implies because that's hard. And it differs wildly from Project to project. There is another solution
and it's the one that we ended up going with. We noticed that this problem is impacting our test devices. We had one device with a very slow sensor in the way that we were set up in our test environment caused that to impact everything. Nobody else had this problem, which meant that this never happened in production. So the other solution, which we ended up doing is won't fix. This is a perfectly reasonable approach. By the way, restructuring your entire data model, incurs, a lot of risk. It incurs a lot of engineering time, which could be
spent doing other things. and when the bug is not actually impacting anything in production, not fixing it is, sometimes the right call Story number for, Screaming zombies. Mutual call posa. This, is it Braintree. We had a ruby code that sent some data over to a Java process. That then sent that data formatted in a certain way to one of our partners. This turned out to be very challenging sending data to a partner is always fraught with difficulty because you know, additional requests partner might
have an outage. We just might experience some random latency on our own and this particular task, the data didn't just need to be sent to the partner. It needed to be sent pretty much a hundred percent reliably and very quickly, it was basically useless. If this data was sent with a delay of more than like 5 minutes, so, for all of this to to work, we needed monitoring So we did was we needed monitoring of the sort. Like what is the latency of this request? How many requests are successful. How many times have
we tried to interact with this partner? Things like that. So the way that we had our monitoring set up, is we used to metrics provider and we sent data to that metrics provider using a thing called stats D. So the way that he works is you write the value of some variable to Stats D, and you can do that whenever you want and then stats D will at some regular intervals, usually every 10 seconds or something like that, right? The data to the metrics provider. The thing is happening here is that it that I really want to jump up and down on? Is that value of Foo
equals 6 never gets sent to the metrics provider. That's because before before stats D bundles it up to send to the metrics provider. We've already written the value food with seven and overwritten 6. So the way that we did this in our code was we instantiated an object of type reporter here, we have a nice happy reporter when Stan she ate it we call start on it and then we write our data to that. Whenever we feel like another reporter at its Cadence, which doesn't have to match stats these caves, Wright's food quiz, 5 to Saturday, The code that
accomplishes this is devilishly simple. We have a method that we Define in order to, to work within the framework. That's running our code called stock start in another method called stop-start, that's all this up and has the metrics reporter. Start sending it stuff and stop doesn't do anything because everything gets garbage collected when we get stopped. Anyway, All right, here's the bag. We noticed that our data was disappearing, we would see actual useful values and metrics up until some point and then it would drop to zero. This was weird, especially because we couldn't figure
out what type of things could cause this. We also noticed that when in in Worlds were this was being run on different machines. It would drop 201 machine and a time. So the Grafton always show us that it that something had dropped to zero, sometimes it would just follow a little bit. If we were not looking at things by machine Alright, so what's happening? Remember when I said that that everything would get garbage collected. Yeah, that wasn't true. so here's an example of what actually happened when stop-and-start were called on our
code Stop was called, didn't do anything. Start was called, which created a new reporter and started that new reporter. So then we would start writing data to that, but the old reporter is still there. So if you look at the top it starts out as a nice happy reporter. As soon as we forget about it, it becomes a zombie process, a zombie thread, So we have this real reporter saying food, goes 6 and then a zombie reporter saying food goes five. This seems bad enough. But Stop and start aren't just called every so often the framework that
we were working with here would call them all the time which means that we didn't just have one zombie, reporter reporting some data that was a little stale. We had hundreds of them running at the same time each reading Foo equal 0. Now it's Saturday, only bundling that data up every so often that meant most of the time Saturday would be reading 0 and sending that to the metrics provider. so we ended up with real data, that a real reporter is trying to say getting utterly drowned out by other zombies screaming, empty
nothingness, not into the void, but into our otherwise very useful graphs This turned out to be one of the hardest of these problems to debug but it was the easiest to fix because the way to deal with zombies is to get rid of them. So, remember when I said stop didn't do anything while the fix was this extremely complicated one, line change. To make the metrics reporter. Stop reporting its metrics. Remember how we had a number of different remedies for for race condition. We had separate data has rigged the race. Various ways of saying,
if we have multiple processes, trying to interact with a shared source of data. Don't do that or allow them to count on things happening in a certain order. Well, the one other thing we can do is get rid of processes. When you don't have more than one process interacting with the data, then you don't need to worry about race conditions anymore. On the fact that the process is the we created were created by accident, doesn't really change. That fact, I hope that you've enjoyed following along,
and And going on this journey with me, through some of the race condition, woes that I've experienced. And then I've heard about from my colleagues at different times, I wanted to share a lot of these stories with you partly because they're fun. I learned a lot and was very entertained by experiencing and by learning about the various things that can go wrong here and also because race conditions can happen anywhere, especially in the distributed world that we're living in now, where most apps are deployed with redundancy, with your code running on this machine.
And this machine sharing a database things like that, these are things that need to be top-of-mind, basically, all the time in order to to make sure that your systems are continuing to work robustly. And finally, I posed a very important question at the beginning of this talk, which is what is the sound of a zombie screaming and the answer.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.