About the talk
RailsConf 2019 - Profiling and Benchmarking 101 by Nate Berkopec
You know your Rails application is slow. Customers are complaining, and your ops teams are provisioning even more servers but it just isn't helping. What now? In order to fix a performance problem, you have to be able to measure it and diagnose where it is. Profiling and benchmarking are our two tools for doing just that. In this talk, you'll get an introduction to the tools available for benchmarking Ruby code, and how to use a profiler to figure out where your program spends its time.
Okay, good afternoon everybody. My name is Nate Berkus pack in this presentation will be an introduction to benchmarking and profiling in the Ruby language. If you're not familiar with me or my work, I am a ruby and rails performance consultant. I've been doing that for the last four years or so. I wrote a book called the complete guide to rails performance and I can solve the company's large and small to make their Ruby. What applications faster? Muscle doing a nationwide Workshop tour this summer. So if you like what you seen today, you can get the full version that one of my workshops
this summer. The workshop is focused on measurement and diagnosis of performance problems, which is what we're going to get into just a little bit today as you'll see on Friday here at the Hyatt Regency. I have just a few more tickets available for that one. So, please talk to me if you're interested in staying an extra day and attending that Workshop. Okay. So today my objective is to give you enough of a road map on the topics of benchmarking and profiling for you to go home open up the rails
application and get a good idea where to start. What's a Google for when you get stuck? I can't cover the complete basics of benchmarking and profiling particularly the specifics of how to use each individual tool. I think you're all smart enough and good enough programmers to go home and you know read a read me to get the exact specifics of how to use a specific tool. But today we're going to cover at least how these what would exist what they're good for and what they can tell you about your application and although this is rails, nothing. I'm going to be presenting here is
specific to rails. So what are begin with a story about one of the most famous Computing projects of all time the Apollo guidance computer and how a performance bug almost caused an abort of the very first moon landing is famous for having extremely low Computing resources One processor running at 1.1 megahertz, 36 kilobytes of memory and 2GB of RAM. However, it was one of the first ever computers that implemented a concurrency architecture that was not based on time
slice time slicing is a technique where you divide a second into smaller increments or slices and then allocate time evenly between those slices, but that wasn't going to work for a real-time application. Like the moon landing previous is a software engineer in MIT devise a new method which he called the executive. It was one of the first ever implementations of Cooperative concurrency and its design became extremely important during the landing of Apollo 11 as dramatized in the movie first man, which I'm about to show you pay attention
here to the error codes that the astronauts call out during the landing. 1 Neuro technician. program alarm go to alarms. So two astronauts are hurtling down towards the surface of the Moon and they get an error code that they never seen before five times in 2 minutes during the final Descent of Apollo 11 a software error occurred in five times. The Apollo guidance computer had to be soft restarted these errors were of the 12:01 and 12:02 error code classes when Neil Armstrong radio. Down to Houston and ask what to do Mission Control
radio back that it was a go. What doesn't show up in this transcript. Is that a 24 year old computer programmer named Jack Gorman at the back of Mission Control told the guidance controller that the error code was a go and it was fine as long as it didn't occur continuously. So there was only one person. At the back of Michigan troll that knew no, it's fine. Don't abort the moon landing. Although we didn't have time to check it white director Gene Kranz actually had a sheet of paper taped to his desk that list of the various alarm code that the agency might raise
during the dissent among these was the 1201 in 1202 code that she said that these error codes were an overflow of the available memory for active work and that they would cause an automatic soft reset of the AGC the 12:01 and 12:02 error codes noted that the duty cycle may be up to the point of missing some functions in essence. The AGC was being overloaded. The bail out routine which is the subroutine that was called automatically after these 12:01 and 12:02 error codes was raised it called when the agency runs
out of corsets. The AGC had eight of these corsets. They were just areas that would where you can store a small amount of data related to a single job or task. So in effect the agency had a job to you that was eight jobs deep if the agency became overloaded and a add too much work to do the number, of course that's in use would keep increasing until someone try to make you a nice job when that happened bailout was called and the agency would reset itself clearing the corset. 21 and
1202 errors raised when you were raised when the agency was running on program 63 that was the guidance mode that they would use to slow down the lunar module during descent. I was the most taxing. Of the entire Apollo mission on the ajc's measly Hardware. This is a graph of a simulation of the Apollo guidance computer computer youth utilization during descent. Do the top line. Here is CPU utilization back. Then the agency Engineers called that the duty cycle during the Final Phase of The Descent. The agency was expected to see 90%
of its available CPU Cycles used the fact that the 1202 1201 errors are being raised meant that that duty cycle had exceeded 100% During the luteal descent the astronauts were on a hair-trigger to abort the mission in case of a safety issue the astronauts specifically the LM pilot Buzz Aldrin had Lobby the engineer's that during The Descent the Rendezvous radar would be turned on for the Rendezvous radar tracked the Command Module which was orbiting above them during The Descent in case they had to have it aboard so there wasn't aboard they wanted to know where
the Command Module was. So they could reposition the lunar module to Rendezvous with the Command Module and get back to Earth. So keeping the Rendezvous radar on was sort of increasing the chances of a successful abort should it be needed? However, there was a bug in the Rendezvous radar as the lunar module tilted forward the Rendezvous radar pointed out into space rather than pointing at the Command Module this change in angle combined with an extremely obscure bug in the power supply to the Rendezvous radar cause an overload
of tasks and date. 2 in Q into the AGC later analysis would show that the Rendezvous radar was taking up 13% of the available duty cycle with these spurious jobs at the Rendezvous radar with creating which was 3% more than they had available. Now, I'm sure everyone here has never had some super wacky shit take down their application in production. The story of the 1202 program alarm reminds us that no amount of performance work outside of the production contacts will always catch every single performance bug one of the
disadvantages of Howlin is executive design was that it was non-deterministic while the Apollo Engineers rent a lot of stimulation. They could never perfectly replicate the conditions of the moon landing one of the main objections I get when I teach people about benchmarking and profiling is they don't like that the purse bug has to happen first in production before we can replicate it on their machines and development. They'd rather do load testing and try to find these things before they happen. But as the Apollo Story shows us load testing is not reality simulation is
not reality. Only reality is reality. I don't think I get a lot from people when I start talking about benchmarking and profiling is that people would rather learn the solutions themselves. They want me to give him the tips and the tricks in the one weird solution that will fix all of their real performance bugs. This is a list of the 321 optimization passes at GCC compiler makes on every line of C code. This list was provided by Vladimir Makarov of red hat. He's working on a jet for a ruby. Each of
these GCC compiler passes makes a very small tweaked a small check optimized one thing here at the house. One thing there. It doesn't 321 times. But we are not compilers. I could teach you 320 things to look at on every single line of Ruby code, but you wouldn't remember 320. You would remember 32. You probably wouldn't remember 3 and I cannot just stand up here and teach you all these things. You can go home and run each of these mental models and every single line of code that you ever going to write because the other problem is that I can teach you three other
things but really it's more like three thousand things for every every line of code that could possibly go wrong and you're probably not even remember three of them. What I'm trying to get out of here is that all performance work without measurement is premature. This is what we talked about when we were talking about premature optimization. If you haven't measured if you haven't proven the problem working on it is premature. So that's why measurement benchmarking and profiling are the foundational performance skill. That's where I start everybody when I'm teaching performance work.
It also says that I'm the number one thing that probably stop people from working more on performance is when they start working on it they start saying like all I do know is part of The Code by 10x and then they let me deploy it and it makes no difference to the application cuz they weren't paying attention to what was happening in production what the method I'm going to teach you here today is to measure twice and PR once if I can show you how to measure show you how to replicate and development. It's going to be far more likely that the work that you're going to do in production on
improving performance in production is going to actually work when you actually deploy it. Okay, so let's do some definitions. A benchmark is a synthetic measurement of the resources or time consumed by a defined procedure. So synthetic is obviously a key word here benchmarks are usually not intended to replicate a exact production process is usually a lot more granular than that and we can measure not just time consumed. So interation per second or how long it took to do a thing.
We can also measure the resources. It consumes and benchmarks that so we can say hey this process allocated this many objects it took up this much memory. That's also a benchmark and it's over a Define for seizure. So one line of code One controller action, and we're going to do that multiple times in our benchmark. Profiles are different seller profile is a step-by-step accounting of the relative consumption of resources by a procedures many subroutines. So step by step accounting rather than getting one number at the end of
like iterations per second or memory consumed. We're going to get a lot of numbers. It's going to be like a quite a lot of data that says hey we spent 10% of our time here 5% of our time here 1% here and we can get that an extremely granular level depending on the tool so we can get numbers, you know that are down to the line number and I use the word relative consumption there because the way most profilers work is that they tell you you spent 10% of your time here, but it's 10% relative to you know, one execution of the entire operation. So it's not going to
say we spend a hundred milliseconds of time here like you would in in a benchmark and give you an absolute time to her profile numbers 10 to be relative. And we can also do resources here so we can profile memory we can profile time and it's going to usually we're going to think about profiles in terms of subroutines which means and broody contacts methods so you can look at how much time you spend a certain methods and starting modules in a give us a breakdown of where time was spent. So the overall process I want to teach you today is this we're going to
start with reading production metrics. If you're not looking at mattress in production, if you're not looking at what actually happens to users and what would actually code is slow for them. Then you're optimizing prematurely you're working on code that if you make it 10 times faster, maybe it only gets called one in a million times. So making that part of the app 10 times faster was probably just to waste your time. Everyone here should be using one of New Relic Skylight Scout in production. Those are the tools that give you these production metrics to
make these informed decisions. Once you've got from those tools you say hey, our search action is too slow. Then you can move to the next step and profile that locally to figure out what part of it slow search takes 1,500. Okay. Well what part of that is slow, you know, it doesn't doesn't help you just know that it's slow you need to know what line is causing that slowness. That's where profiling comes in once we identify from the profile what parts of the code or slow then we're going to take those slow parts and make a benchmark for them. So if we identifying are slow
search action that the part that's actually slow is not the search. It's the rendering of the results. So 1 benchmark when I do is how long does it take to render one line of the result K. That'll be our benchmark? What we have the Benchmark, it's extremely easily and fast iterate on the solution because we have an actual number so we can say hey this rendering of the result we could do that in a thousand Generations per second on Master when we start it and once we you know, we done my amazing solution and now its 2 million iterations per second great, and then we
know it's faster and then we can deploy that change. So profiling answers what is slow these production profiling tools? Normally can't give you that much detail profile ads overhead by necessity has to because you're you're no longer doing just the work. You're not doing the work and then analyzing the work at the same time you recording what happened? So that's more work than you were doing before there for profiling by necessity will add work and make your apps lower.
Sophie's production profilers like Skylight New Relic Scout all have to make a trade-off and help Bill 10 to give you less data us but they don't impact your production users. But in development, we don't care about that. We can add a lot more overhead to give us more accurate data. So when we're profile and development, we're going to answer the question. What is slow to answer that question accurately. We have to have an accurate profiling environment and actor profile environment is one that looks and works as
production like way as possible. One of the biggest parts of this is the data. So probably the majority I would say of rails performance problems are related to active record which means they are related to data and it means that the data in the database has to look kind of like production when we call a user. All we need to get the same amount of rows back as we would in production to whatever saying extent we can I know if you work a Shopify that's not a sane thing to do in development, but it needs to be more than 10 more than 10 rows when you call
user. All because if that happens in production for real, that would be very bad. Also present like that. You can either download production dumps and just throw them in your development database if you're cool with that, if you're not I've seen people have an automated process for downloading production dumps and then eliminating the p i could be just dropping the user table. That's what ruby gems. Org does. They just dropped the tables for the dump that are sensitive it could be, you know going through and changing every male to some random string or whatever another
approaches to have an actually good database e-file. That's the most difficult approach because it has to be good in like crate millions of rows and that takes a long time. So it's probably the least realistic option. But if you work in that kind of environment where you have to do that and not use real data, then that's what you're stuck with. Building that has to work the same as the code. These are some settings in particular that cause development mode to work in a non production like way in the actually had talked with someone yesterday about this personally. I
just go into production. Our environment / production. RB and just hack it until it runs and development so that I'm working in a production like code but then you can't check it in so it's not really sustainable solution for teams. What I seen people do that actually works is they going to development. RB and then having these changes behind an end variable. So it'll say you know, if and profile is set then change the setting to this and then still in development mode. They can get some production like code Behavior.
Do the things that make profiles and development look wrong and look not production like our code reloading compilation so sprockets you're going to have to turn it off and recompile your assets and active record and record migrations. So if you don't turn that off you're going to see this thing at the beginning of every profile that's checking for there's been a migration or not turning all those things off makes the code behave in a slightly more production. Like way gets all the crossed out of our profiles
and gives us a more accurate result that looks more production like There are two modes for measuring time in almost all profilers CPU and wall time CPU time counts time based on CPU Cycles. So that means that things like Ayo and sleeping don't show up in those profiles because no time has passed relative to how many CPU Cycles have occurred for this process is based on the clock on the wall stopwatch time. So everything shows up waiting on iO sleeping that that all counts and steep in a wall
profile. Do we have a method that sleeps for 4 seconds profiler a wall time profiler? It will say this method took 4 seconds to execute. This is example of output from Ruby Prof. If we had this in CPU time mode, it would look like this method took 0 seconds to execute because it's been 0 time on the CPU. What makes a CPU time profiling useful when you want to take IO out of the equation, sometimes you don't want to do that. Sometimes it is relevant that calling user.
Some scope takes three hundred milliseconds. You want to keep that in the profile. Sometimes it doesn't if you're profiling a database driver, for example, you don't care how long it takes the time for the app actually takes for the the database to come back with a response using CPU time profiler. There are two designs for profilers. This is true of any profile urnotjustin Ruby there are statistical and tracing profilers statistical profilers sample a percentage of the available stack frame. So it's facing every millisecond or 10 milliseconds.
They interrupt your process record the stack frame and then allow the process to continue then all of those stack frames get aggregated into the profile a tracing profiler actually hooks into the language runtime and actually every time you call a method it says hey you called string + + increments that counter by one record the stock when it when that happened so tracing profilers actually record every single thing that happens with a crazy profile you'll see this method was called 3092 times with the
physical profile. You can't get that level of detail. Cuz it only sampled, you know, maybe 1% of the available stock frames and it's inferring the rest of them statistically there for statistical profiler. We have two main profilers in the Ruby world that we use in development there stack Prof Andre Prost stack prophets. Statistical. Ruby Frost is Tracy. I like them both for different reasons. They both have different styles of output one is statistical one is Tracy. I would encourage you to give a shot to both because they you'll see if they're both useful for a different
tool different problems. This is what's a cross output. Looks like when you use it with rack mini profiler, which is another tool. I'm going to talk about little bit more two-minute. This is a flame grass. So each one of these little bars is one stack recording each horizontal line is one part of the stack wood you're looking at when you're looking at a flame grass is what you want what you want to pay attention to is that a long flat sections because the lawn flat section you that one function took a long amount of time to execute. So that's
interesting. You might want to dig in a bit more in that if you zoom in here, you'll actually see the actual line that for each layer of the stacks you zoom in there and say all right. We spent a lot of time, you know, executing this function. Let's go check it out. The only thing I like to look at what I'm looking for and graphs is repeating shapes. So we've got a repeat of shape here like four or five times it does this sort of up and down. It's been similar parts of the code. That's interesting. Because that means we're in a loop. I took a lot of time if we can make that Luke faster than
we can speed up his part of the code. This is what the output of Ruby Prof looks like when you use it to call stack visualizer. It's one of my favorite visualizers for Ruby for off. I use it all the time. So each one of these lines here is one layer of the stack. There's some numbers here with your not going to be able to read from the back. But basically he's lying to the stock and tells you how much time was spent in this stack late and it's Children of the purse what percentage of the total profile and then what percentage was spent in this stack relative to the parent. So
we're kind of looking for that call stack profile is if one line says we spent 98% of our time in this stack or its children and then the next line says we spent 90% of our time in the Mist call sakurasou. Let me see lost 8% there that we spent 8% of our time in this call stack. So that's interesting. I would dig in more there maybe Benchmark that part of the app Try different views try different formats. One format one profile is not perfect for all problems. You have to try different ones. I've noticed some people just work better with different profiles. I
guess it just some people more visual some people want to see the graph. So just give him a shot. We also profile memory Richard talked a little bit more about this is talk today. I would go back and watch that if you haven't seen it, but just a reminder we can profile not just time but also resources the memory profiler gem is probably the best way we haven't really right now to do that the best way to interact with memory profiler in stock for off right now is the rack mini profiler Jem. It's the number one thing. I think everyone
should be using on the rails applications as a little bad to the upper left that gives you an easy-to-use interface for the memory profiler and stock drop gems to look at any controller action and profile. quickly as possible All right now. Talk about benchmarking. There are three kinds of Benchmark micro macro and app there different levels of scope. So I would say micro Benchmark is one that runs just one to a hundred lines of Ruby code before it in rates again, and I don't mean one hundred lines of
Ruby code in The Benchmark that you wrote. But you know, if you have one line in The Benchmark that calls out to up to a hundred lines of code before it is rates again. Micro benchmarks alliterate, you know 1 million times per second or more. They're very dangerous because they can mislead you about how big of an impact they're going to have if you have 1 million Russians per second on a piece of code and you can turn that into two million. That's a 2X Improvement. But if that particular line of code is only executed once during a controller action that's going
to have almost no impact at all on that controller action. A macro Benchmark is I would say, you know anywhere between 1000 and 1 million iterations for second. It probably runs a couple thousand lines of rubico before executing macro benchmarks are probably the kind I write the most it might be like one call to a service object or you know one method on a user model they're bigger than micro benchmarks, but they're small enough that they're not an entire controller action an entire controller action. I would call it at
Benchmark those maybe in her eighth only a couple hundred times per second. They're useful, but at some point you're just sort of have too much in The Benchmark and making changes to the code may not result in a very big change to The Benchmark itself because so many other lines of code are running so This is Benchmark IPS. It's my preferred benchmarking tool written by Evan Phoenix. This line right here is the default config options for time and warm up
Benchmark iPic will run your code once as a warm-up and then discard the result. It does that because it needs to fill up Cash's needs to load your code and all that. The first run of any piece of code will always be slower than the second run and stove on and so forth until it eventually reaches a steady-state. So getting to that steady state is what the warm-up is all about Adventure Park ipas. Does that for that We Run The Benchmark Itself by default for 5 seconds. So each of these blocks here or going to get benchmarked against each other. This is an actual
Benchmark from a Gemini used to maintain so you should get run for 5 seconds to record the result. The outfit looks a little bit like this. So this is a warm-up phase right here. And then this is the actual Benchmark itself and then a comparison for you at the end and we'll do it for a really basic statistical test to tell you whether or not this result is probably important. Benchmark / ipsa gem add this little bit on top which measures the allocations of one run of that Benchmark. It's super cool
as Richards talk talk about this morning object allocations 10 to correlate directly with time consumed. So I like to add this in as sort of a little idea of hey, maybe this the reason, you know, my logs benchmarks hear slow is because it's allocating so many more objects than this one. I just saw this this simple benchmark. So then we can Benchmark not just I'm also resources that again that Benchmark gem is called benchmark ipsa. The warm-up and bench more time parameters for Benchmark IPS are usually fine for
95% of the benchmarks. You're going to write. If you want to change the warm-up time, you're going to change it because it needs to be longer to keep the results table. If the world time. Is it long enough the initial, you know thousand or million tons of the code are going to be slower than the next thousand or a hundred million thousand or million executions of it. So basically you want to end the warm up when the iteration speed is stable 2 seconds is usually enough for see Ruby but if you're on jruby if your benchmarking truffle Ruby or if using Ruby two. 6-inch
it mode those runtimes need more time to get to a steady-state because of what they're doing under the hood. So you might need a warm-up time of 10 seconds from really seems he's really long one of times it like 30 seconds. So that's why you would change that parameter bunch more time generally should just be long enough to keep the variance of the result low. You might have seen in that Benchmark results here that we get a little plus or minus margin of error here. If that number is high like it is here in this case. It supposed to minus 20% I like to see it at less than 5%
But if it's really high sometimes all you need is just one The Benchmark for a little bit longer and then you can get that to his tackle noise out, especially if it's a fairly low iterations per second like this dispatch work is so I would sometimes increase the bench more time just to get rid of some of that variance, especially when we're running app level benchmarks that only iterate 10 times per second. Okay. My number one best working to beware the loop you have to be aware of what is being run in the Benchmark Loop and be sure that was being run in there is actually what
you want to measure because if you have extra overhead being run in the Benchmark, then the overhead is going to obscure the speed difference. It's going to be gained by the changes that you're making it. So she was a mistake. I see in most people's benchmarks my teaching this for the first time active record inside of the Benchmark Loop. So the difference between drip line and this line is that we're doing an active record look up every single time. We run that piece of code now active record query cache is going to catch this you're not going to go to the 4th every time
but there's still a lot more coat it's being run here that is being run here. And if you don't actually care about how fast acting record looks up stuff and what you really came here to Benchmark was the do things. Then get this stuff out of the Benchmark Loop. Okay, so be very careful attention to what overhead is being done inside of each iteration and get out as much of it as you can so you only measuring the stuff that you actually care about I think most app should have a benchmarks folder. If you do get into the habit of writing benchmarks for your
performance related changes, which I hope that you do. It's be a shame. If you just throw those away at the end of every pull request, they can be very valuable tools for measuring changes over time. Just check them into a benchmarks folder and let him live there. Some people ask me about the you know, what's like a fancy C. I set up I can have the run my benchmarks every time I push to CI that's cool if you can figure that out, but it's also cool if you know, somebody says hey there's a folder for this. Can you run that and post the results on this feature Branch just for you know to to
check it what happens if an abscess it tends to be that the same Parts get touched over and over for performance reasons. So it really doesn't take that many benchmarks to cover the important parts of an app. Just checking in know that they're there try to use them in your pee ours for the future. Just checking a man could be a really big increase in organizational knowledge for the team. Okay. So this was the process that I've laid out today reproduction metrics turn that into a profile and development to figure out what part of the action is slow create a benchmark. Once you figure
out the slow part create a benchmark for it. Then it rate on that Benchmark. Try to change some code see if you can make the Benchmark faster. Once you have deployed start the process over again, this is just the scientific method of observe create an experiment run the experiment and observe it again. We can also short-circuit this method this at any point when we just have figured it out. Like I've looked at enough rails applications that frequently all I have to do is look at the production metrics to know what's wrong and what to fix
but there's also times when I come home, so I was like, I don't even know why this action slow as long as I don't know. I'm going to keep going to the next step. Okay, once you know, you can you can bail out at some point when you're in a big enough team or you're making a PR to a big enough project or project. That's not your own. They're going to want to see the numbers anyway, so like if you make a performance PR2 rails, they're just going to tell you please come back with a benchmark and that's fine. That's that's the only way that real could possibly work on this road and fuel just
telling them some code is faster app. So you might have to create a benchmark to work with your team. But in small teams or on your own projects just go as far as you need to to figure out what the solution is. Okay, so that's what we talked about today. I'm hopefully convince you why measurement is important. I've talked about the differences between benchmarking and profiling what tools are available and a brief introduction is to what how to use them. And a little bit on profile a little bit of benchmarking terms used and how to do those things.
After Apollo 11 landed a frantic search began at Mission Control to figure out what had caused those 12:01 and 12:02 program alarms. Luckily Apollo was built with a kind of production profiler. The Apollo guidance computer was constantly sending down data about its own state to Mission Control and they looked at that data and what station airs notice that this bit was flipped to a weird value and it was the value of whether the Rendezvous radar was on or not know that the astronauts were flipping the Rendezvous radar to the position where it was tracking the CSM
the astronauts had changed and basically not told anybody that was working on the software. The Apollo Landing was never actually tested with a real working Rendezvous radar. So when the astronauts were testing this procedure, they would flip the Rendezvous radar switch, but it was a dummy switch. The lesson is not that we can and should test every possible production States. The lesson here is that it is impossible even with the massive Manpower and cash resources of the Apollo program and literally only 36 kilobytes of code. They couldn't figure out every
possible permutation of state that could possibly impact the performance of the Apollo Lander. Instead we can build software that fails gracefully and then try to recreate the pieces and development using profilers and benchmarks and use those to study and replicate what happened in production. That has been my talk. Thank you very much.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.