About the talk
RailsConf 2019 - Death by a thousand commits by Kyle d'Oliveira
On the 1st commit, things are getting started. On the 10th commit, the feature is live and users are giving feedback. On the 100th commit, users are delighted to be using the application. But on the 1000th commit, users are unhappy with the responsiveness of the application and the developers are struggling to move at the velocity they once were. Does this sound familiar?
We will go over some of the pieces of technical debt that can accumulate and can cause application performance and development velocity issues, and the strategies Clio uses to keep these kinds of technical debt under control.
Good morning, everyone. Welcome to death by a thousand commits. My name is Kyle. Olivera. I am based out of Vancouver British Columbia Canada. Never been to the West Coast pacifically the west coast of Canada is beautiful. I highly recommend it. I am a staff software developer at Cleo and Cleo is a twelve-year-old fast-paced company focused on making legal software to transform the practice of law for good and I work for them on their back and infrastructure team and my team focuses on three major things we
focus on how we can make the code bases scalable both in terms of the data set size and the code size. We make the code base is approachable. So how long does it take for a new developer to onboard into a specific area or just on go to The Cove Base in general and three we focus on the overall developer experience with working in all of our code bases. This means I'm thinking a lot about technical debt over the years that I've been with Cleo. I have seen a lot of it go through from various stages of how the company handling it from ignoring it entirely to
setting aside a little bit of time to fixing it. But still letting it accumulate getting to a point where we are striving to pay down all technical debt to improve all future development. I don't know how many times I've looked at pieces of code and thought who wrote this only to find out it was actually me from several years earlier. Technical debt is a really interesting concept it often doesn't really change the behavior of the system. So the business might not see it. What it does do is it slows down development or potential impacts the performance of the site and that
is something that the business does c and as developers, we need to Advocate to the company about the risks and rewards of leaving technical debt and are paying it down and pushing forward. Nope, the slowness can come from developers that are getting pulled off projects that are needing to address emergencies. Or maybe it's just that things are just overall slower and they need to spend more time and maintenance and doing optimizations or possibly it's because of Legacy code that it kind of exists that developers
need to really understand and work with that. Just slow them down. So back in 2009 Martin Fowler wrote a post about how technical that was introduced into code bases and had these technical debt quadrants on one access. We have delivered and inversion actions on the other access. We have Reckless and prudent. Good morning, breakfast inadvertent area. This is often when people don't actually know any better and there's a hard deadline or things that are pushing people to move faster. @cleo, we've had these instances where we've deployed code in
our API, which we thought was the right decision at the time and years later. It is still biting us and we just didn't know and we can't get rid of it. When you're kind of move into the Reckless deliberate this is when people know better but still push for it. Anyways, they cut corners and they don't worry about the technical debt that they are introducing. Sometimes it might be okay to be in this quadrant when speed is absolutely crucial when first-to-market is the most important thing. This might be okay or when you're working on a prototype and you just need to
understand the system. But ideally we're moving more into The Prudent deliberate side of things where you know, what the consequences that you are delivering of the technical debt and you deal with before there are any major consequences are as the consequence of come up. I clearly try to be in this quadrant as much as possible. And we try to schedule time to clean up things immediately after projects. We take the view that if he could deliver something a month early get some user feedback stabilize things. That's actually a thing that we often want to take and then use
the extra month to just clean up with the project isn't done yet. And we've got it into the hands of the users early. We're going to clean up right away. Leslie it also have this inadvertent prudent category where you introduce stuff we didn't really realize it but what you do realize that you use that as a teaching lesson. I think it's unreasonable for developers to know all of the ways that technical debt can be introduced and that which ones will slow them down. But as much as we want to be in the top right corner oftentimes we end up in the bottom, right and we'll use
this use these as learning lessons for the whole company and all of the other Developers. Once technical debt is in the code base though, which is inevitable. We need two strategies on how to pay it down in the most basic ones are just to be reactive to deal with the consequences as they come. This could be because the numbers didn't know that the technical that existed in the code based in the first place or it could be that they knew it was there and they deemed it was an acceptable risk and that's no longer the case. So this could be your database is now slow and servers are unable
to serve request in the business of pushing to have developers fix this as soon as possible or could be that there's no sense of a situation going on that are causing customers to leave and either way it's going to have now become the highest priority for the developers and distract them from the projects. You could look to be a little bit more proactive and identify problems and try to tackle them before they come emergencies you could think about what will become a problem and fix it before then you can think about what could indicate there is a problem and look to monitor
it so that you get morning or if you know that there are bad patterns that you want to keep out of your code base in general. What are you try to keep them from day one rather than trying to fix them as you find them. It is often hard for the business to understand the risks associated with leaving technical debt in place and as developers. It's our responsibility to advocate for time to fix these. Lastly we can look to invest in tools and now investing in tools my slow you down but in the long run, it will greatly speed you up. Sometimes the tools have
already been built. So all that's required is a little bit of research and maybe set up but sometimes they are not building. You need to build them yourself. I'd like to focus on for lessons that we've learned that Cleo. That have helped us mitigate some of the technical debt going forward in our code base. So some of the tactics and tools that I'll talk about today may be directly able to be incorporated into your court bases or hopefully you're starting to think about how can you tool away some of the technical debt that exist in your project? The first lesson
I want to talk about it is dealing with physical. Why would you want to fix technical that when you can just find a way to just remove it entirely. So there's a whole classification of problems. You can put some time into and then in the future developers just don't need to think about it anymore. I want to talk about situation and clear. We noticed this. There's a point in time when there was a controller and point that wasn't being consistently fast. We all of our local pass all of our tests on our staging servers felt like it should be one of the fastest things but it's acting production. It
was one of the slowest and we want to know why. Who started digging into what was going on here and we found out that the endpoint was making hundreds if not thousands of database queries. So turns out this endpoint has a whole bunch of problems with their all and plus one clears the silent performance tax. Both of you who aren't familiar with and plus one queries issue that shows up in rails and in it or anything that uses a relational database and what ends up happening is you make first query a collection of objects in the database for
instance contacts, and then you try to access association's on those contacts and it will end up happening is you end up making queries individually for each contact has a little bit of a fairly fast and its small numbers. It's not really noticeable, but when there are hundreds of them and thousands of them, it can significantly slowed you down. They Don't Really influence and user Behavior, they don't go to influence the behavior of the system so they're often easily overlooked what you should be looking at is making two queries one for the contacts one for the collection of emails
Who's looking at an actual code example of how these would start popping up in your code base overtime? Imagine? We have a very basic Json API. Here's a very simple controller just grab some contacts renders them as Json. I put lemon on so we don't render all of our contacts if used active model Siri leiser's you can have as he realizes that look like this. We just will start with ID and name and that we have a kind of functioning Json API where we get some sort of response back up all of our contact IDs and names.
Break in the future customers are requesting that this API would be really good. If it also returned the emails of the contacts if the model has the association's already set up adding it to the API is really easy. You just add one line. We can write this out feel like we're very productive get this out the door very quickly and feel like we are moving very fast, but did you notice the nplusone clearly that I just introduced so I'm talking about it. So hopefully it's front of mind. But if you were reviewing this piece of code in a
standalone pull request, would you have noticed it? Maybe maybe not requires human efforts in human makes mistakes and they're just a little bit of technical Death Cure. But if this got out introduction, would it be the end of the world? Probably not even notice it Future this API goes on what happens when we have phone numbers and then we have addresses and then maybe contact have emergency contacts would also have phone numbers and emails addresses in the ideal fashion rolling making a handful of queries for all of this information. If you were paying attention, we're
doing this in the nineties fashion. We end up we could make one quick way for the contacts and then 200 for the emails and 204 the phone numbers and 204 the address and so on and we could just be a handful of Queer as an ideal say it is actually hundreds or thousands. This is basically the situation that we ended up at Cleo. No reels does offer a way to fix this. We can look for whatever associations are being used. And your love them are controller. Now that original and plus one query that I introduced is gone. There's also a couple issues with this
approach. It requires a lot of manual human effort developers need to understand the usage of the code and the association's that are being used. They need to find the place with the initial query was made so that they can only take a nap the includes for small systems. This is really easy before complex systems where there's a lot of distance between the where the data is being used and where the date has been queried. It can be really challenging. In the previous example, the usage was in the kitchen Siri Eliezer for the Critter was actually in the controller's the developers
need to understand. Italy fix is one instance of these and plus one queries at a time. So everytime we touch these files we need to potentially be thinking about this. More weight for the redeemers more work that people need to be thinking about what happens when we stop needing the association anymore except that we're doing the pre-loading if we were to fix it, or do you just accept that? We'll just leave it as it is. The really has to be a better way, there's tools that kind of exist that can help raise signal whenever and plus one query is generated, but it still
requires manual effort to fix them. So we built a deaf. That we wanted to load the association's Justin times, but it ended up being called. The preloader name is hard. But it's cool. We use we've had in our applications running a production for the last couple years just removes and plus one query is entirely We dropped the demand for project we can figure out to be globally enabled and here's what it looks like. So here's a graph of the database time used by that example that I was talking about. We have a data point for every 30 seconds and we're recording about 2 to 4
minutes in a very spiky fashion of database time at the same time. Blue spiky using a lot of database time. And after we released the gym it's are using much less stabilized around 30 seconds about 4 to 8 times better just for removing and plus one queries and adding those two lines. Do I project? Similarly, here's a graph of the database queries to a specific table in our database. We were make the scales unfortunately cut off, but the we are making hundreds and thousands Aquarius here and then we
deployed the gym and we stabilized just at a handful even took like some samples of the request before and after we deploy the gym we found that like the 95th percentile of requests we recorded for that sample was twice as better at just after the plane the gym and the 99th percentile was 3 times better after the gym what could have been a huge amount of effort of us going through these controllers all of our controllers and removing all of our nplusone queries turned into an investment. We could build a tool and that we don't need to consider and plus one clearest anymore all of our
developers all the people who review code to focus more on delivering value to the business. You could be thinking about what pieces of technical debt. Could you just automate away entirely. There could be a publicly-accessible gem for you that can do it for you. But if not, maybe you can contribute back to the community. In an Ideal World, we're sharing all of this knowledge and making it. So these pesky issues We complain about our just a thing of the past. Second lesson I want to talk about is clean up food that you don't use. I'm sure many of
you have experienced may be a situation like this where you do a big framework operate like rails or something and you start coming across code that breaks with a new version. But you don't know if it's being used or not so you could test around it. So the tests are failing. So do you support it or do you not? Like when you probably support it, cuz their tests are failing and that's telling you some personal. But if you deleted it don't save you a whole bunch of effort. If it's not being used by Cleo, we had a situation where we were doing a interface operated so
short term we decided that we would duplicate all of our views one set of use had the new interface once interviews had the old and this process for customers only lasted a couple months. But we never ended up cleaning up those duplicated views and in development mode. You can still see both. So what ended up happening is we ended up supporting both for way too long until someone got fed up spend the effort and just started deleting things on mess. Unfortunately did learn a lesson. Couple years later we find out that okay, we need to do the big
upgrade and we've accumulated huge amount of right tasks that are just not being used anymore and we support them. We should at some point we get set up and put a little bit of effort in and we delete everything. Fortunately, we did learn a lesson there either. Recently, we're starting to switch from rails generate HTML templates to a front-end that consumes a Json API. That would transition off the old and point. We really want to clean them up as we go but it's really easy to not though. It's hard to know if the code is actually been using production. There's test cases
that will fail if you change this code or remove it. Leave it alone doesn't cost anything except for later on when you need to continue supporting. It has record bass involves. You just have yourself in the future. We wanted to fix us up. So we had some ideas at first we started playing with them the first idea we called the tombstone which was the simple helper that made an error and threw it to our bugs bag incidents, which is the Starbuck tracking service. Just put single lines into Brandon methods or views. Did you say this is dead as of this date
and then we can look for Trends in deception. All A little bit of a backwards signal with the exception is being thrown we know it is being used so we can't delete it but it was a starting place. We took this and helped give us a little bit of confidence to remove stuff but was still a lot of manual effort. We think anything that requires a lot of manual effort automated. So recently we started building the flapper. We called the dead code detector. I'm can't quite tactical deer attacks or views from a previous example
yet, but it's can track a lot of methods using in production for about a month now and last week we did some analysis of reports and started deleting things as we would set it up leave it around for reasonable amount of time get a list of things that haven't been used and then you can delete them. There's a similar Library out there somewhere gem called cover band. It's another useful tool. If this works for project great, they have different trade-offs cover band gives you fine or drain detail will let you dig into individual lines that hasn't called or
not worth the dead. CO detector won't give you methods but cover bands a little bit more volatile as the code changes cuz it's tracking line numbers and it does have a larger memory footprint. Either way. You can delete code that you're not being used at the project were Jim and basically from a little bit of a month of work. We could just delete about 300 controller options 1000 methods and we're being really conservatives here. I'm only tracking controllers and models and not a lot of R objects. We're going to probably go hard on this over the next few months since just start
purging everything that we can pretty easily because we have a tool that tells us what were allowed to purge. Hopefully we've learned a lesson this time and we won't have to maintain this code that we don't support anymore. Now you're thinking about how you can leverage tools to start removing things that are causing you pain that you don't actually need to support anymore. Third lesson, you'll never escape the need to handle emergencies. They will always come up and they're often very painful open terms of interruptions and database time.
Walk you through the transition point when we really started to learn our lesson here. The February many years ago. Everything was going smoothly and someone to start seeing periodic periodic spikes in Long request being served by servers. Are users. This means that request that you are normally really quick or taking several seconds, or they're actually being outright rejected and users are presented with a nice happy error page. What are you at the time we were using New Relic as a application performance monitor so we can look at what end points are slow but there's a problem here to
win the database get slow. Everything gets slow. So New Relic was no use it was a pile of every end point being slow. We need to figure out the cause and fix it. The kind of thing. We're developers aren't going home until they figure we have very little signal and a lot of noise are users are getting more and more upset. We started digging into the database when we started looking at my SQL slow query log, which was just a big dump of information. That was kind of overwhelming. We don't have
a lot of people familiar with the logs the time and so is all hard lots of Parts all at once. So sorry going through it by hand looking for anything that could be out of the ordinary looking for anything. That was maybe long query or anything that I found a lot of froze. We weren't sure we would dig through this by hand forever. We would eventually find one. We know we have a suspect for what causing the spikes but we have a query and that's about it. we have to lean on the knowledge of the developers in the group
we had to Look for the most experienced people with codebase show them the query and ass like a does this look familiar. Can you tie this back to anywhere in the code or quite a lot of mental effort and tribal knowledge, but eventually we would find it and we fix it and the whole process would take several hours from several developers and it's happening frequently enough that we're getting pulled off our projects and everything is getting disrupted significant weight. It's not fun for everyone. We have pieces of code that will make inquiries with assumptions of the sport true anymore.
Where is that didn't have limits queries with lots of joins because at the time the data set with small and that was fine. So we started looking at the tools of what we could do Basecamp has this Jim Martin Elia it attaches query or comments to any query that goes through active record. Which is really nice setting up is really straightforward. You just add the gym and what could start off is a very ambiguous query starts giving you a little bit more metadata. We now have a place to start when looking for information doesn't give us everything.
But now we know that this queer initially got initiated somewhere in the use of controller in the index action. But maybe you can eat more information and we did it at this point time as well as five to there. Is this active support Kern attributes? And you can use this to add additional Primitives to the SQL query if you would like. So you can add a user ID. So, you know, which user is causing us to understand if there's Trends got the request ID so you could go dig into the log of herself and look up if there's any additional
information anything that you find my be useful. Setting something up like this is really straightforward the instructions or in both marginalia and the docks for setting up active sport current attributes. You can inherit from it at a couple after boots, whatever you feel like you'd want. You can set it in your controller or jobs or wherever you want to run it. So here's just a before action around in a controller instructions on how to extend it by adding a couple methods to the margin area comment module and then you tell marginalia to use these
and once again or query now has more meditative into it the pain of working from the query back to the source code with a little bit less and we didn't need as much tribal knowledge to figure things out and we had a tool to to point us in the place where we should start. Maybe we solve the wrong problem here. Rather than try to go from the logs back to the query. What if we could just have the code tell us what's wrong in the first place. Rails has recording some of this information Forest. We just need to listen for it. So I can sort notifications is a nice library that provides
information about various events that are happening in your system. In particular has the active record. SQL event, which is Ray or thrown or instrumented. Every time that a query is executed racked up record. And with this you could Ask is the duration greater than sin through threshold. And if so do something with him. There's lots of things but active support notification can track that you can subscribe to and add things into your own application. And if there are things that it doesn't track these are things that you can
add an instrument yourself things. Like how long does a transaction take maybe that's something you want to take how much memory does a controller action take maybe that's something you want to turn. We just took a very straightforward approach we decided we would turn that into an exception with a very clear message and record that in our blood tracking service. It was and what started off being this nebulous problem of someone digging through logs by hand turned into this nice big detailed report where we could see exactly which query along the exact line this
came from we could see trends. When did this start? When did was this the last instance how frequently is this happening? Because he releases it would give us a lot of information but also will give us proof that when we fix this these just disappear and we could actually say it's gone now. Generate a lot of proactive work for us. But this was all work that was going to save us from having to deal with emergencies in the future. We can go to the business and say remember that February when the users are really upset and was all caused by
these long. These prayers aren't as bad, but they're getting worse. We need time to fix them. And we can put that in terms of the business understands and the business like okay. Yeah, let's tackle a couple of these every week and we can get into a situation where we don't need to fight fires as much and when they do come up we can look for these exceptions and has it tell us what's wrong. Who started off as a super painful process to us is now much less painful doesn't magically solve her performance problems, but it does
make it so that when performance problems start cropping up we have much more signal. People can now focus on the actually fixing the problem rather than just finding the problem. We've invested in ourselves. So the bad technical debt is just easier to isolate and Tackle. You can think about when emergency situations crop up in your work. How do you deal with it? What are the things that you just spend a lot of time trying to deal with? How could you add tools to make that easier? Waffles and I want to talk about is keeping the bad patterns out. if you do a
bad pattern Can you automate keeping it out of your code base entirely? You don't have to rely on people to review it. Looks like an example. Sometimes you might need to write to a temporary file because of some simple code that writes the temp temporary file as ID and name of all of your contacts in your database and then closes and removes the file. There's a problem here and told that Cleo doesn't want it. It's codebase. If at any point in time something in the middle of the code brakes, maybe Dame's actually a method and not a attribute
and it froze an exception at some point in time. We'll just leave the temp the temp files around when you really need is to wrap it when you say if we start this block we need to ensure that we always delete the 10 file afterwards the documentation for it says it's completely unnecessary to delete temporary files, but if you don't they will be problems. And in practice we noticed that there are times when 10 files with don't get cleaned up at all somewhat rare, but we have a lot of things that don't work with him files. We can just fill up a server and maybe at some point that server runs
out of space and that service on able to handle things near the partial outage all because of little bit of technical debt. Suppresses bad technical debt from entering your code base. This is a bit of a fabricated pull request. If you saw this what happened, maybe you have an experience developer make a comment you say hey, this is not great give you a little motivation of why might be bad. There's a blogpost about help you learn or maybe here's the documentation. Here's what you could do better. A person who submitted the code review will read those understand them fix it.
Everyone's happy. But what would happen if the developer didn't catch that and we deployed its production will begin the world probably not even notice any problems but things like this aren't a problem until they are maybe next year. It's when the server runs out of space and you have a partial outage to keep that patterns out of the case. I'm sure you've heard of this one static code analysis Gem and various rules. You can add these enforcement as part of your CI process
because so add this as part of your pre-commit hooks that only look at the files that have changed for much faster feedback. Very simple rules. I can't even handle doing autocorrect for you know, some of the things in Robocop or more style bass. Do you say hey, I only prefer single quotes as opposed to double quotes of Rubicon. What's that? Actually enforce more complicated things? Like I don't want to ever rescue from exception. Know if you ever rescue from exception it will be fine most of the time except for when it's not and those are the times that it's going to bite you. You can
use real pack up and thinks I guess it's late. Like this is a pattern. We just never want an Arco bass and it can never get enough because we have these checks point. If there are more complicated things you can write your own cops and Supply that back to the community and help prevent other things. No, sometimes you can't enforce the rule hard. So Shopify had a new tactic they referred to as shit list driven development. I love the name. But it's actually it's a way to whitelist some places in your application saying that this is allowed to do
certain behavior and Blacklist everything else. Volume like whitelisted a few things you can continue working to try to fix them without disrupting the development team. So pull directly from the blog example of what were you saying? Hey. We have a job that does some things that we don't like anymore. There are three classes that are allowed to do this. Nothing else without any else anything else tries to do this. It will just error with again a very clear message. But we can work to remove Class A Class B and Class C from doing this
behavior when we were ready. We can remove this method but we know that is not going to get a new birth. We stopped the bleeding. Also pulled for the block or something like that. So you can have it as part of your past week. You could say I only want these three classes to inherit from reddest model. And now these are the only class that can inherit from it anything else. This will fail. This is great. We can start preventing these patterns from getting in her code base early and give clear messages. We talked about the temp file thing. I clear we have a part of RCI, if you add 10 file.
New and violate some of those those conventions we give you this tells you why give you some motivational why might be bad tells you what thinks you're doing and what you could potentially do better. If there was a new developer that just copy paste and some code from stack Overflow. They would I can get immediately feedback of why it's not great what they could do. And how they can afford it require any manual effort from a developer. There's lots of ways to kind of do this clear. We have a bunch of shet lists. We have a bunch of ruvalcaba and it's a useful tactic for us to just
keep all of this bad patterns out. I shall we have the temp file. New one, but we also do things like Global variables don't use Global variables. It's now a part of the Rubicon rules. You just can't add it randomize ation inspects. There was a taco earlier but fixing flaky text while we clean them up and move everything over to better ones. Whenever you have a bad pattern that has caused you problems. You should think about what could we do to prevent this pattern ever being reintroduced into the stolen base again, and remove the manual effort? I make it so that I could do it
for you. It's important to always be thinking about how we can remove technical debt and acknowledge when we may be intentionally introducing it the more we focus on utilizing and building tools to make it easier the more we can focus on delivering value to the businesses. We work for if we focus on building better tools and tactics and sharing them with the community. We all become better. What can you what ideas do you have that you could get back to the community? Are there ways that you know that you can eliminate technical debt, and now we all don't need to
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.