About the talk
RailsConf 2019 - Fixing Flaky Tests Like a Detective by Sonja Peterson
Every test suite has them: a few tests that usually pass but sometimes mysteriously fail when run on the same code. Since they can’t be reliably replicated, they can be tough to fix. The good news is there’s a set of usual suspects that cause them: test order, async code, time, sorting and randomness. While walking through examples of each type, I’ll show you methods for identifying a culprit that range from capturing screenshots to traveling through time. You’ll leave with the skills to fix any flaky test fast, and with strategies for monitoring and improving your test suite's reliability overall.
All right. Just to introduce myself. I'm Sonia and I really appreciate you all coming to my talk and rails, for having me and today I'm going to be talking about and also about how reading a lot of mystery novels help me learn how to do that better. I want to start out by telling you a story and it's about the first flaky test that I ever had to deal with. It was back in my first year as a software engineer and I worked really hard building out this very complicated for me. It was my first big front end future and so I wrote a lot of
unit and features has to make sure that I didn't miss any edge cases. Everything was working pretty well, and we shipped it, but then a few days later we started to have an issue. A test failed unexpectedly honor Master Branch. The feeling test is one of the feature test for my form but nothing related to the farm and changed and it went back to passing in the next build. The first time I came up we all kind of ignored it test feel randomly once in awhile, and that's okay, right then it happened again and again and so I said fine.
Okay. No problem. I will spend an afternoon digging into it and I'll fix it and we'll move on. The only problem was I had never fixed a flaky test before and I had no idea why I test would pass or fail on different runs. So I did what I often did when trying to debug problems that I didn't really understand. I started out by trying to use trial and error. So I made a random change and then I ran the test over and over again to see if it would it would still say location Lee and I kind of trial-and-error approach can work sometimes with normal by the sometimes even start using trial
and error and that leads you to a solution that helps you better understand the actual problem, but that didn't work at all with this lady test trying around and fix running it 50 times. It didn't actually prove to me that I fixed it and then a few days later even with that fix it still failed again, so I need another. And that's exactly what makes fixing flaky test. So challenging you really can't just try random fix fixes and test them by running the test over and over again. It's a very slow feedback loop. We eventually figure it out a fix for that play key test but not until
several different people had tried random fixes that failed and it sucks up entire days of work. And the other thing I learned from this was that even just a few flaky test can really slow down your team when a test fails without actually signaling something wrong with the test Suite you not only have to rerun all of your tests before you're ready to deploy your code which slows down the whole development process. You also lose a little bit of trust in your test weed and eventually you might even start ignoring real failures because you assume they're just like So super important to learn
how to fix flaky test efficiently and better yet avoid writing them in the first place. For me the real breakthrough and figuring out how to fix leggy test was when I came up with a method. Instead of trying things randomly. I started by gathering all the information I could about the flaky test and the times that it failed then I use that information to try to fit it into one of the five main categories of us will talk about what those are in a minute and then based on that. I came up with a theory of what might be happening then based on that theory I would Implement my fix
At the same time that I was figuring this out. I was on kind of a mystery novel bench and it struck me that every time I was working on fixing a flaky test. I felt kind of like a detective solving a mystery after all the steps to do that. At least Monopoly read which are probably very different from real life are basically starting with Gathering evidence. Then you identify suspects you come up with a theory of means and motive and then you can solve it. And so it's thinking about fixing flaky test. That way made it much more enjoyable and actually became kind of a fun
challenge for me instead of just a frustrating and tedious problem that I had to deal with. So that's the framework. I'm going to use in this talk for explaining how to fix leaky test. Let's start with step one Gathering evidence. There are lots of pieces of information that can be helpful to have when you're trying to diagnose and fix it like you test some of those include are messages and output for every time that you've seen it fail. Find the time of day does failures occurred. How often the test mailing. Is it failing every other time or just once in a blue moon and which tests were
run before the test when it failed and in what order? So how can you efficiently get all of this information a method that I've used in the past and that has worked well is to have any time a test fails on your master Branch or whatever Branch you would not expect a few failures on because I've had to pass before merging into it have any failures on that Branch automatically sent to a bug tracker with all the metadata. You need such as a link to the CI build will they failed? I've had success during this with roll bar in the past, but I'm sure other bug crackers would work for this as well.
And when doing that it's important to make sure that it failures for the same test can generally be grouped together in the bug tracker. It might take a little bit of configuration or finessing to get this to work, but it's really helpful because then you're able to cross-reference between different occurrences of the same failure and figure out what's what they have in common which can help you understand why they're happening. Alright, so now that we have our evidence we can start looking for suspects and was like you test the nice thing is that there's a basically always the same
set of Usual Suspects to start with and then you can narrow down from there. Those suspects are Is encode order dependency time on order Collections and Randomness? I'm going to go through each of these one by one. I'm going to talk through an example or to how you might identify that a test fits in 20 particular categories. And then how you would go about fixing it based on that? So let's start with a sink code which in my experience is often one of the biggest categories of psyche test when testing rails apps. I say a sink code. I'm talking about Tess
and which some code runs asynchronously which means that the events in the test can happen in more than one order. the most common way this comes up when you're testing rails apps is in your system or future tests the most rails apps use capybara, either through rails built-in System test or R specs teacher test to write end-to-end test for the application that's been up a real server in a browser and then the test interacts with the app similar to the way an actual user would And the reason you're necessarily dealing with a sink code and concurrency when you write capybara test is
that there are at least three different threads involved. There's the main thread executing your test code does another thread that capybara spins off to run your mail server process that's running the browser which capybara controls via driver. The make us a little more concrete. Let's talk about a simple example. Imagine. You have a capybara test the clicks - submit post button in a blog post form and then it checks at that post was created in the database. Here's what the happy path for the test looks like in terms of the order of the events that occur within it first in your test code. We
tell capybara we want to click on that button. So in the browser that triggers a click which then ejects request to the real server, which creates a blog post in the database when that request returns it updates DUI and then your test could check the database and see if the coast is there everything works great. So the order of events in the browser and server timeline here is pretty predictable provided you're not optimistically updating UI before the request that created the blog post returns. And that's one reason why you could should avoid optimistic updates if you can
because they think creative both of flaky test and it's like usernames. But the events in the Tesco timeline on the top here are less predictable in terms of where they happen in relation to the other ones. The one problematic ordering would be if right after we click on summitpost the test code it can move right along to check the database and it happens to get to the database before the browser in the test railserve. I finish going through the process that creates that blog post. So then we'll check the database. We won't see anything there in the test will fail. The Fixx here is relatively
simple. We just need to make sure that we wait until the request is finished before we try to check for anything in the database and we can do this by adding one of capybara is waiting finders like have content which will look for something on the page and then retry until it shows up up to a certain time out. So basically it will check the page to see if post later to see if it's not there it'll wait for a second and then check again until it there and only then will be able to move on to the next line of code where we check for the Post in the database. So with that code implemented this
is what the timeline looks like have content will block us from moving forward until the rest of the process has finished. So that's a relatively simple async flake and probably something that you've dealt with if you were in some capybara test, but then get a lot more complicated and sneaky. So let's look at another example. Here we have a test which goes to a page with a list of books books on a sort button waits for the books to show up in that sorted order using when a capybara sweating Finders. Thanks again to reverse that order and wait for the order to show up again.
So provided expect alphabetical order and expect reverse alphabetical order are both using the same waiting finders. I was talking about that will retry until things show up right place. It seems like this should work. Well we're waiting in between these things that we do but it is possible for this to be flaky. The way that that can happen is if when we visit the books pass the books happen to already be sorted. So then we'll me click on sort and expect the alphabetical order that expect alphabetical order line is no longer actually waiting or blocking anything for
us. We can get passes immediately. We move on to the next click. So both of those quick scan actually happen before we reloaded the page the first time with the books in alphabet order it. Just kind of acts like a double click and as a result we can end up with the task never getting to the state of being in Reverse alphabetical order. The fixer is actually fairly similar to the last one. We just need to add some more specific waiting binders to make sure that we don't move on protest code to quickly. So in this case, we might look for something on the page that indicates. They're
actually finish beyond the fact that the books are in order. Then we can safely move on to the next. So if you're looking at a given flaky test and you're trying to figure out whether it might belong to the spacing code category the first question I usually look at is is it a system or feature test something that uses capybara some other way of interacting with the browser the number one place for this job just possible that you have other areas without explicitly rating for the results
even in a place where it looks relatively innocent. It's always a good idea to make sure that you're behaving like a real user would and waiting in between each thing you do to see the result. We are trying to identify whether the flake is due to some a Cinco de can also be helpful to use candy bars ability to save screenshots which you can use by just calling save screenshot directly provided you using one of the drivers that supports that or the capybara screenshot gem which helps you wrap your test. And so that every time they fail you'll capture a screenshot of the end state of the
text. We're looking to prevent a seems like such a few things to keep in mind first. As I mentioned make sure your test is waiting for each action within it to finish and when you're doing this make sure you're not using sleep or waiting for some arbitrary amount of time. It's important to wait for something specific and that's because if you wait for an arbitrary amount of time at some point your total just happen to be running slowly enough that arbitrary amount of time isn't long enough and it looks like again it also means that you might be waiting
longer than you need to in a lot of other cases because the process happens faster and so by waiting for something specific you can avoid both of those pitfalls. Does important to understand happy bars API which methods weight and which don't so everything based on fine. We'll definitely wait but there are few certain things like all that don't wait in the same way. It's important to be familiar with all of capybara docs and how to use its tools correctly. Finally, it's important to check that you just searching your making in. The test is working as you expect it to it's
very easy to write assertions that look like they're doing the correct waiting behavior of actually don't as we saw in that double click example sometimes content is already on the page in a different place and it allows kind of accidental success. Alright, so let's move on to our next suspect order dependency. Define i define this category of taxes any that can pass or fail based on which test ran before them. Usually this is caused by some sort of State leaking between test. So when the state another test wouldn't win the state another test praise his
present or not present it can cause a flaky test to fail. And there are few potential areas where shared State can happen in your test. When is the database? Another is global or class variables if those are modified within your test. And then there's also the browser typically one of the biggest issues with rails apps is database States. Let's talk about that a little more in-depth When you're writing test each test should start with a clean database that might not mean a fully empty database. But if anything is created updated or deleted in the database
during a single test, it should be put back the way it was at the beginning. I kind of think of it like leave no Trace when you're camping. This is important because otherwise those changes of the database could have unexpected impacts and later test or create dependencies between test that you can't remove or re-order test without risking any failures. There are several different ways to handle clearing your database State wrapping your test in a transaction and rolling it back after the test is generally the fastest way to clear database and it's the default for testing real but in the
past you couldn't use transactions with capybara because the test code in the test server didn't share a database connection. So they were running in separate transactions and couldn't see the data in each other's pants. Rails 5 System test actually addressed this by allowing shared access to the database connections and test they could look at data within the same transaction. However, rain and transactions can still have some subtle differences from normal behavior of your app. And so there may be reasons why you still don't want to use them as their cleanup method. For example,
if you have any actor commits hook setup on your models that only run when a transaction commit those probably won't run after using transactional clean up. So if you're not using transactional clean up another option is the database cleaner gem, which can clean with either truncating tables or using a delete from statement on them. And this is generally slower than transactional but it is a little bit more realistic in terms of your not having an additional transaction wrapped around everything that's happening in your test and the important thing to make sure if you're using this
method is it this database cleanup is running after Capybaras clean up. So capybara does some work to make sure that the browser state is cleared and settled between each test including wait waiting for any Ajax request to resolve and if you clean your database before that clean up and waiting happens, so they could create some data that doesn't get cleaned up. So there's a bit of an ordering issues here and you can avoid it if you're using R-Spec by putting your database cleaner call in an append after block. So why do I tell you all of this the thing about
database cleaning is it should just work and it often does especially if you're just using rails basic built-in transactional cleaning, but there are a lot of different ways that you could have your wheels app and test Suite configured and it is possible to do it in such a way that certain gachas are introduced. So it's important to know how your database cleaner works when it runs and if there's anything it's behind especially if you're starting to deal with flaky test that seem to be order dependent. Let's look in an example of this. Let's they were using database cleaner with a
truncation strategy. Maybe we started doing that be back before a real at 5. Let us share a database connection that suck. Maybe we don't want any redness around transactions one of those reasons. We noticed this is slow. So somebody comes in to optimize the test with a little bit and they noticed that we're creating book genres in almost all the tests. They decide to propose on Earth before the entire test Suite runs and then exclude them from the database cleaner. So this will speed up our test of it, but it does introduce a gap in our cleaning. if we make any kind of
modification to book genre since we're using truncation to clean the database instead of transactions that update won't be undone between test and this could potentially affect later test and show up as an order dependents like to be clear. I'm not picking on database cleaners here. I just want to give an example of how a minor configuration change could allow you to create more Flakes and why it's important have a good understanding of how clean is actually working in your test sweet and the trade-offs you might introduce depending on how you do it. As I mentioned at the beginning there
are some other possible sources of order dependency via shirts State. What is the browser since that's wrong with in the same browser that can contain specific States depending on which test just ran capybara works pretty hard to clean all of this up before it moves on to the next test. So that should usually be taken care of for you but it is possible again, depending on your configuration. I have everything set up that maybe there's something that speaks through and so it's good to be aware of that as a possible place where shirts it could be. Another is globulin class variables
as I mentioned. If you modify those they could persist from one test to the next normally Ruby will yell at you if you reassign Global variable, but one area where these can kind of sneaking in as if you have a hash assigned to a global variable and you just change one of the values within it since that isn't reassigning the entire variable it won't come up with him or me. All right. So if you're looking at a particular test and you're trying to figure out why what whether it's being caused by order dependency, there's a couple different strategies you can use. When is dusk to start out by
trying to replicate the failure with the same set of test in the same order? So if you can take a look at how it ran in your cir, wherever you saw it fail and run the exact same set of test together with the same seed value to put them in the same order and it fails every time you do that. Then you have a sense of this is probably in order dependent test. But at that point you still don't know which tests are affecting each other. So to figure that out. You're probably going to want to cross reference each time you seen it failed and see if the same test for running before that failure.
R-Spec has a built-in bisect tool that you can also use to help me read on the set of tasks to the one that produced them dependencies. How are you may find that it can run a bit slowly depending on how fast your test Suite runs. So sometimes it's easier to just look at things manually. In order to prevent order dependency, you should make sure that you can figure your test Suite to run in random order. This might seem kind of counterintuitive, but the goal is it to surface or dependent test quickly not just when you add or remove or move around a certain test running in
random order is the default in many tests and is configurable in R-Spec. Also, make sure you spend some time understanding your entire test set up and turn on process and work to close any gaps wear shirts date might be leaking through from one test to another. All right, we're going to try and next suspect time. This is probably the one that gives me the most headaches. This category includes any test that can pass or fail to plan the time of day that it is run. But start with an example here, we have this code that runs in a before save hook on our task model. It sets an automatic due date to
the next day at the end of the day if my due date is in already specified. Then we write this test we create a task with no due date specified and we check that. It's when we expect it to be the current date plus one at the end of the day. Seems like it should be fine. But this test actually starts failing after 7 p.m. Every night, very strangely and how could that possibly be happening? The trouble is were using to slightly different ways of calculating tomorrow here tomorrow uses the time based on the time zone. We set for a real zap will date.
Today plus one will be based on the system time. So this is sometimes in UTC in our rails apps timezone is EST will be 5 hours apart and after 7 p.m. Will be different days which results in this failure. So how come your way this one easy fix would be just used a. Current with respect time zone instead of date. Today. Another option would be to use the time cop Jam which basically allows you to freeze Time by mocking out what ruby sense of time it is and so it was time cop we can freeze time here. It would be January 1st at 10 a.m. And then are expected due date can just be a static
value January 2nd at 11:59 p.m. And we can check that the due date is that exact value? Just any kind of helpful for making your test a little bit more explicit and hat and simpler so that we don't contain complicated logic that it itself needs to be tested. We are trying to determine whether a given psyche test is time-based. The first obvious thing to do is to look for any references today or time in the coat under test. If you have a record of past failures, you can also check whether they've all happened around the same time of day. and finally, if you suspect it's time bass you
can add timecop to that Speck just temporarily to set it to the time of day where you seen it fail before and see if it feels every time when you do that at least an example using timecop to freeze time can make it easier to write reliable test that deal with time and also easier to understand exactly what you're testing. Another strategy that you can use to surface time-based flakes is to set up your test sweet so that it wraps every test and time got timecop. Travel mocking the time to a different random time of day on each. One of the sweet does print it
out before the test runs. So this might seem a little crazy, but it's actually very helpful for surfacing test. That would normally only fail after business hours when nobody happens to be running the tests sweet so that you see them during the normal business day instead of that midnight when you just got woken up on call and you're trying to desperately ship a deploy and the test weed keeps failing unexpectedly. It's just important to make sure that you're printing out the time of day that each test is running at and that you're able to then we run the test with that same time of day for
the later if you're debugging of failure. You can easily replicated. Alright, our next suspect is Uncharted collection. This is a relatively simpler one. This is just any test that can pass or fail to pay on the order of a set of items. That's within it that doesn't have a pre-specified order. So let's look at an example here. We have a test where we're looking at a set of active posts and we expect them to equal some specific post that perhaps we've created earlier in the test the issue with this test isn't the database query in the first line doesn't have a
specific order. So even though things will often be returned from the database in the same order just by chance. There's no guarantee that this will actually always happen and when it doesn't this test will fail So the fix is just to make sure that were specifying an order on items returned by the database and that also are expected post are in that exact same order. When trying to identify whether a flaky test is being caused by Honor to collections look for any assertions about the order of an array the contents of array or the first or last item
in one. If you're using our specs you can use the match array expectation, which allows you to basically just a certain things about what's in an array without caring about the order or you can just add an explicit store it to both the expectations and what you're looking at. Alright, so we got into our last possible suspects which is Randomness and you might think that all of these different categories of flaky tests have something to do with Retina since they're randomly failing. But in this case I'm talking about test that actually explicitly invoke Randomness via a
random number generator. So here's an example of a test data Factory that uses Factory about to create an event if we have a validation that if enforces start date, sorry and suppose we might start out with just having start date and then adding and date after that at some point and we decided okay start date will be sometime 5 days from now and date will be sometime 10 days from now we could run into an issue where ended actually ends up being lower than start date since they're both random values. So we added validation two events that enforces that at some some percentage
of time are tests that deal with events will fail because they'll have invalid data. So in this case, we're just better off being explicit in creating the same date every time and it's my field little counterintuitive because Randomness can seem useful as a tool for testing a large spectrum of different types of betta and so on but there's a big downside and not being able to know what your tests are actually testing and then having them be flaky and so a better strategy is to is to actually that's right test for each of the specific cases that you would like to test.
So if you're trying to identify whether Randomness is causing your flake, the first obvious thing to do obviously is to look for a random number generator and often it will come up in your factories are fixers. But another thing you can try is using the dash dash seat option in either many tests are R-Spec and that will allow you to run the test with the same speed value for net for Randomness and generally the same random values produced with all respect. You just want to make sure that you actually have Colonel that has friends that to our specs con fig seeds so that
those so the passenger seat option will actually control the random. To prevent Randomness face flakes as I mentioned, the general strategy is to remove Randomness from your test and it instead explicitly test the boundaries and edge cases that you're interested in. It's also generally good idea to avoid gems like Faker to generate data for test. There are useful for generating ballistics seeming data and your Dev environment. But in your test at least from my perspective, it's more important to have reliable Behavior than random and realistic
data. Alright, so now we looked at all of The Usual Suspects so we can move on to forming a theory and actually solving act like a text message read. My first strategy tip when trying to find a fix to act like a test and there isn't an obvious one popping out for you is just to run through each of those categories that I've described and look for any connection right in defiance mines that could link this test one of those. So even if it looks perfectly fine, but it is dealing with the date may be digging down that particular path. And again just resist the urge to use trial and error to
test fixes. It's more important to form a strong theory about how this might be happening for. Even if you're not 100% sure it's going to work a lot better then you can try on hair. What you can do and what might involve a little bit of a different kind of trial and error is trying to find a way to reliably replicate failures to prove your theory. So just came up a little bit with when I was talking about random dates and Order dependency because for those who have more control over the factors that might be producing the flake. You can freeze time. You can run the test in the same order you
can use the same random seed and then potentially be able to replicate the failure and since most likely test typically are flaking very infrequently and passing most of the time if you're able to get to fail two or three times in a row, you can be pretty confident confident that you've replicated it versus the other direction when using trial and error to test a fix and you're seeing it pass it takes a lot of runs to be confident that that's actually what you're saying. So you might try those methods and still be stuck. Flaky tests are hard one strategy. You
can try to get to that situation is adding some code that will give you more information the next time it feels so if you got like a hunch that something's off like crap with what's in the database are you're curious about what the value value of a certain variable is add that to something that would be logged out in the test. And then the next time that it fails and see how you can take a look at that and factor that into your process of fixing it. Another strategy that I really like is pairing with another developer since fixings like you test is so much about your having a deep
understanding of your testing tools your framework and your own code. Everybody is going to have some gaps only have two people working together. You can fill each other's gaps in a little bit and you can also help keep each other from going down rabbit holes are getting too frustrated chasing down the same wrong. Another question. I see coming up. A lot of this point is can I just delete it? I can't fix it. It keeps failing. Is it even worth it anymore? Why did I become a developer that kind of thing? My first response to this is that you have to
accept it. If you're writing test at some point inevitably you are going to have to deal with like you once you can't just delete any test that starts to be flaky because you'll end up making significant compromises in the coverage that you have for your app. And also learning to fix and avoid flaky test is a skill that you can develop over time and it's one that's really worth investing in even if that means means spending two days fixing one instead of just deleting it. That being said when I'm dealing with Lady test, I do like to take a step back and think about the test coverage I have
for a future holistically. What situations do I have coverage for which one's am I may be neglecting or ignoring and what are the stakes of having the kind of bug that might slip through the cracks in my coverage has been looking at is for a very small Edge case with low stakes or it's something that's actually well covered by other test or could be covered by a different type of test. Maybe it does make sense to delete it or replace it. And this ties into a bigger picture idea, which is that when we're writing tests were always making trade-offs between realism and maintainability using
automated test instead of manual QA is itself a trade-off in terms of substituting in a machine to do the testing for us, which is going to behave differently than an actual user would but it's worth it in a lot of situations because we can get results faster and consistently and we can add test as we code the different types of tests will go to different lengths and then like real life and generally the most realistic ones are the ones that are hardest to maintain and keep from getting flaky. There's an idea of the test pyramid which I think was first came up with Mike Kohn, but I think
there's been many other spins on it since and this is my particular spin. You should have a strong Foundation of lots of unit tests on the bottom there simpler there faster and they're less likely to be flaking and then as you go from less realistic test two more realistic has ceased have fewer of those types of tests because they are going to take more effort to maintain themselves are courts of rain. So they're testing a lot more becoming a lot more situations and the fees more realistic tests are just in general more likely to become flaky because there's so many more moving Parts
involved. So it's wise to keep the number of them in your test weight and balance test the major happy past the major problems, but leave certain edge cases and other types of testing for more specific and isolated test. The last thing I want to talk about is how to work with the rest of your team to fix leaky test. It shouldn't be just a solo effort since like you test can slow everyone down in a road. Everyone's trust in your test Suite they should be a really high priority to fix if you can manage it. They should
actually be the next highest priority underproduction shires. This needs to be something that you talked about as a team that you communicate to your new hires and that you all agree. It's worth investing time and to keep each other moving quickly and trusting your test Suite. The next thing I recommend is that making sure you have a specific person assigned to each active like that person is in charge of looking for a fix deciding whether maybe you need to temporarily disable the test. Will they will it's being worked on if it's frequently flaking that person should reach out to others
for help if they're stuck and so on and it's important to make sure that responsibility is spread out among your entire team. Don't just let one person end up being the flake master and everybody else ignores them if you're already sending flakes to a bug tracker as I suggested in the Gathering evidence section, you can use that as a place to assign them to different people. The next thing I recommend is sitting Target for your master Ranch pass rate in tracking at week-over-week. So for example, you could say that you want to have bills on your master Branch pass 90% of the time
and then by tracking us that helps you keep an eye on whether you're progressing towards that goal and courts, correct, if your efforts aren't working and you need to invest more in it, or if you have kind of wider issues with your test leads reliability. Drop us all up if you record remember just one thing for my talk. I hope it's that it's like a test don't have to just be an annoying and frustrating problem or something. You try to ignore as much as you can fixing them can actually be an opportunity to gain a deeper understanding of your tools and your code and also to pretend
you're a detective for a little while. So hopefully this talk has made it easier for you to do that. Thank you all for coming. If you have any questions, feel free to I'll be up here and you can come up and ask me afterwards.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.