Duration 35:52
16+
Play
Video

RailsConf 2019 - Cache is King by Molly Struve

Molly Struve
Lead Site Reliability Engineer at DEV Community
  • Video
  • Table of contents
  • Video
RailsConf 2019
April 30, 2019, Minneapolis, USA
RailsConf 2019
Request Q&A
Video
RailsConf 2019 - Cache is King by Molly Struve
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Add to favorites
6.36 K
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speaker

Molly Struve
Lead Site Reliability Engineer at DEV Community

I am a Site Reliability Engineer that thrives on making performance optimizations. I have extensive knowledge about Elasticsearch and also work with Redis and MySQL daily. My primarily coding language is Ruby and I enjoy finding ways I can use it to further speed up an application. I am also passionate about building and monitoring infrastructure in a way that is easy to use and easy to scale.

View the profile

About the talk

RailsConf 2019 - Cache is King by Molly Struve


Sometimes your fastest queries can cause the most problems. I will take you beyond the slow query optimization and instead zero in on the performance impacts surrounding the quantity of your datastore hits. Using real world examples dealing with datastores such as Elasticsearch, MySQL, and Redis, I will demonstrate how many fast queries can wreak just as much havoc as a few big slow ones. With each example I will make use of the simple tools available in Ruby and Rails to decrease and eliminate the need for these fast and seemingly innocuous datastore hits.

Share

Welcome. Everyone's cash is King. Mine is mine is Ruby and I am the lead site reliability engineer at Kennesaw security. If you came to this talk hoping to see some sexy performance crafts such as this one, but I'm happy to say you will not be disappointed. If you're the kind of person who enjoys and entertaining tips or two-year technical talks, then you're in the right place. If you pay me this talk to find out what the heck is I reliability engineer does also bingo. And finally, if you want to learn how you can use Ruby and rails to kick your performance up to the next

level. This is the talk for you. As I mentioned. I'm a site reliability engineer most people when they hear spiral I buildings near don't necessarily think of Ruby or real instead if they think of someone who's working with my sequel radass AWS elasticsearch postgres, or some other team is just over a year old only 1st phorm you can bet we did exactly what you'd expect. We went and we found those long-running horrible my SQL queries and we have to have

them ready index for their missing. Using Flex statements to avoid impulse one query and buy processing things in small batches to ensure we are never going to me and records are the daily. So once we also had elasticsearch churches that were constantly timing out, so we wrote them. We even overhauled how are background processing framework rescue was talking to redis. All of these changes lettuce and big improvements in performance and stability. But even with all of these things cleaned up we were still seeing Hi-Lo it's across all of our data source, and that's when

we realized something else was going on now rather than tell you what was happening. I want to demonstrate so for this I need a volunteer. I want to come on super simple. Awesome. Thank you. What's your name? Oh, I already forgot it. What's your name? Need one more time. What's your name? What's your name a million times, but I don't have to keep asking in their name and I'll give you guys a involves pen and a piece of paper. Even simpler just write it down so he can write your name down there. You can just do your first name is fine. Terrible handwriting, that's

fine. I can read it. Okay. Now that I have changed name written on the piece of paper. I no longer have to keep asking for it. I can so they read the newspaper. This is exactly what it's like for your rails application. Imagine on your application Shane's or datastore if you're around application information. Is it take your Rams application on long time to do it if instead your rails application makes use of a local cash, which is essentially what this piece of

paper is doing it can get the information it needs a whole lot faster and it can save your day to store a whole lot of headache. The moment at Kenna when we realized it was the quantity of our data source that was wrecking havoc on our data source. That was a big aha moment for us. And we really start trying to figure out all the ways we could decrease the number of days or hits we were making W4 again doll the awesome ways. We use Ruby in rails to do this. I first want to give you a little bit of a background, so you have some contacts or on the stories. I'm going to

share. Can a security helps Fortune 500 companies manage their cybersecurity risk? The average company has 60,000 assets is basically anything with an IP address in addition. The average company has 24 million vulnerabilities a vulnerability is basically any way you can have an asset now, it's all of us can be extremely difficult for companies to know what they need to focus on in fixed first and that's where kind of comes in at Cana all this data and we run it through a proprietary and then

nose tell our clients what Warner bill disposed the biggest risk their infrastructure. So they know what they need to focus on his six first. When we initially get all of his ass and vulnerability to the first thing we do is we put it into my sequel truth from there into elasticsearch elasticsearch is what allows our clients to really slice and dice. Are they anyway they need to in order to fax Athens vulnerabilities into elasticsearch. We have to sterilize them. And that is what I want to cover my first story serialization particularly. I want to

focus on the sterilization of vulnerabilities because that is what we do the most of it. We initially started sterilizing vulnerabilities for elasticsearch. We were using active model sterilizers to do it after mile sterilizers hook right into your active record objects. So all you have to do is to find the fields you want to sterilize it takes care of the rest. It's super simple, which is why I was naturally our first choice first Lucian. However, if he came in last great solution when we start serializing over 200 million vulnerabilities a day

as a number of vulnerabilities, we were sterilized and increased the rate at which we could sterilize them drop dramatically in our database he can to max out on CPU the caption for the screenshot on slack was 11 hours and counting. Our database was literally on fire all the time. Now some people might look at this graph and their first inclination BSA. Why not? Just beef up your hard work. Unfortunately at this point we were already running on the largest RDS instance AWS had to offer so beef and has not an option.

We looked at this graph. We thought okay. Just got to be some horrible long-running my SQL query know that we missed. So off we went hunting for that elusive horrible my SQL query much like Liam Neeson in Taken. We were bound and determined to find in the root cause of our But we never found those long-range horrible my SQL queries because they didn't exist instead what we found a lot of fast millisecond queries that were happening all over again. And again all these weirdest war lightning fast, but we were making so many of them at a time that our

database was being overloaded. You merely start trying to figure out how we can see your lies all this data and make less database calls when we were doing it. What If instead of making individual calls in my sequel to get the data for each individual vulnerability we group all the vulnerabilities together at once and make one called am I supposed to get all their data at one time From this idea came the concept of both sterilization. We started with a cash class. This cash class was responsible for taking a set of vulnerabilities and

a client and then running all the mice equal lookups for them at once. We then took this cash class and we passed it to our vulnerability sterilizer which still held all the logic need to sterilize each individual field except now instead of talking to the database. It would simply read from our cash class. Sultan example this inner application vulnerabilities have a related model called custom Fields. They allow us to add any special. We want to vulnerability prior to this change. Will you sterilize custom fields? We would have to talk to the database. Now, we could simply

talked or cash class the payoff of this change was big for starters the time it took to sterilize vulnerabilities drop dramatically. Here is a Consul stop showing how long it takes to sterilize 300 vulnerabilities individually Texas just over six seconds and that's probably a pretty generous Benchmark considering it would take even longer when our databases under love. If instead we sterilize is exact same 300 vulnerabilities and bulk of the decrease in the number of database hits we have to make to sterilize these vulnerabilities to sterilize those 300 vulnerabilities individually we

have to make 2100 call today to 2102 sterilizer stained 300 vulnerabilities in Falk, we now only have to make seven boom again As you can glean from the math here. It's 7 calls per individual vulnerability or 7 calls for however many vulnerabilities you can group together at once in our case when we're sterilizing vulnerabilities were doing it in batches of a thousand. So we took the number of database request. We are making for each batch from 7000 down to seven. This large drop in

database request is plainly apparent on this my SQL queries graph which shows the number request for making before and then after we deploy the bolts are they should change? With this large dropping request came in large drop in database load which you can see on this our dscp musician graph prior to the database at 100% afterwards. We're sitting pretty chilly around 25% and it's been like this ever since. The moral of the story here is when you find yourself processing a large amount of data try to find a way that you

can use Ruby to help you process that data in bulk. We do this for sterilization, but it can be applied any time you find yourself processing data in a one-by-one manner take a step back and ask yourself. Is there a way I could process this data together in bulk? Because one call for a thousand IDs is always going to be faster than a thousand individual database called. Now, unfortunately, the sterilization Saga doesn't end here. Once we got my sequel all happy and sword out then suddenly reddest became sad and this folks is the life of

a site reliability engineer. A lot of these we feel like them you put one fire out. You start one somewhere else. Just feed one thing up in the load transfers to another in this case. We transferred to load from my sequel to redis and here's why. When we index vulnerabilities into elasticsearch, we not only have to make request to my sequel to get all of their data. We also have to make calls to redis in order to know where to put them in elasticsearch in elasticsearch vulnerabilities are organized by client. So to know where is

vulnerability belongs we have to make a get request to run has to get the index name for the vulnerability. When preparing vulnerabilities for indexing We Gather up all their sterilize vulnerability hashes. And one of the last things we do before sign into elasticsearch is we make that run has request to get the name for each vulnerability based on its client. Deserialize vulnerability hashes are grouped by client. So those red is get request is often returned the same information over and over again. I'll keep in mind all these writers get requests

are super simple and very fast. But as I stayed before it doesn't matter how fast your requests are. If you're making a ton of them. It's going to take you a long time. We were making so many of these simple get request that they were counting for roughly 65% of our job run time. Would you can see in the table and is represented by the brown in that grass? The solution to eliminate allow these request. Once again was Ruby in this case. We ended up using a ruby hash to cash the elasticsearch index

name for each client. When leaping through those sterilize won't ability hashes rather than hitting right as for every single vulnerability. We could simply reference our client index is hash. This meant we only had to hit redis once per client instead of Wands for vulnerability. Look at how this paid off giving me three example batches of vulnerabilities. No matter how many vulnerabilities are in each batch only ever have to hit Reddit three times to get all the information. We need to know where they belong as I mentioned before

these batches usually contain a thousand vulnerabilities a piece. So we roughly decrease the number of hits. We are making a Reddit a thousand times which in turn led to a 65% increase in our job speed. Even though redis is fast a local cache is faster for you to get a piece of information from a local cache is like driving from Downtown Minneapolis to the Minneapolis-Saint Paul Airport. To get the same piece of information from reniss is like driving from downtown to the airport and then fly all the way to Chicago to get it.

Renesis so fast that it can be easy to forget you're actually making an external request when you're talking to it and those external request can add up and have an impact on the performance of your application is Mastermind. Remember that redis is fast, but a local pass such as a hash cash is always going to be faster. So interesting two ways that we can use Simple Ruby to replace our de estar hits next hour talk about how you can use your active record framework replace your day is torment. This past year cannot we started our

main my SQL database and when we did we chose to do it by client. So it's Pines day to lose on its own charge David to help us we chose you the octopus garden Jam handy dandy using method which when pass a database name has all the logic and needs to know how to talk to that database. Because our information is / client. We created the sharding configuration hash which tells us what client belongs on what sharded database. Each time we making my sequel request. We take the

client ID. We pass it to that Transfiguration hash in order to get the database name that we need to talk to him. Given that we have to access the starting figuration Ash every single time we make our first thought was why not just us because Renaissance fast and the configuration now if you want to store is relatively small. It was smart first. Eventually that configuration has grew and grew as we added more and more clients. Now 13 kilobytes might not seem like a lot of data. But if you're asking for 13 kilobytes a

data millions of times it can add up in addition to our growing figuration has we were also continually increasing the number of background workers that we had working so that we could increase our data through until eventually we had 285 workers chugging along at once now, I remember every single time one of these workers makes me my simple request if first has to go to redis to get that 13 kilobytes configuration hash it all quickly added up until we were reading 7.8 megabytes per second from us, which we knew was not going to be sustainable as we

continue to grow in and sign. When we start trying to figure out how we are going to solve this problem when the first things we decide to do was take a look at active records connection object. All the information needs to know how to talk to your database. So naturally we thought it might be a good place to find somewhere to store configuration half. So we jumped into console check it out. And when we did what we found was not an active record connection object off instead. It was created

a surprise to us. Can we really started digging into our gym source code trying to figure out where the heck this octopus proxy out of head come from And when we finally found that box was proxy object much to our Delight it already had all these great helper methods that we could use to access our starting configuration. Problem solved rather than having hit reddest every single time. We made in my simple request all we simply had to do was talk to her local activar connection object. One of the big things we learned from this whole

experience was how important is to know your Jen. It is crazy easy Tunes to jamming your damn file. But when you do make sure you have a general understanding of how it works. I'm not saying you need to go and read the source code for every one of your jobs because the next time you got to jump your jump. Well, maybe set it up manually the first time in a console so you can see what is happening and how it's being configured if we'd had a better understanding of how are Oxbow Sharingan was configured we could have avoided

this entire run has headache. However, regardless of where the solution came from yet again cashing locally in this case using her active record frame work as a cash. He's always going to be faster and easier than making an external request. These are three great strategies that you can use to help replace your data star hits now. I want to shift gears and talk about how you can use Ruby and rails to avoid making sure hits you don't need. I'm sure some of you were looking at this thing. You already know how to do that. But

let's hold up for a minute because this might not be as obvious as you think for example, how many of your rent goes like this? I know you're out there cuz I know I were an awesome if there's no user IDs then we're going to skip all of this user processing so it's fine, right? Fortunately that assumption is false. It's not fine. Let me explain why turns out if you execute this where Clause with an empty array, you're actually going to be hitting the database when you do

notice this where one equals zero statement. This is what active record uses to ensure no records are returned fast one millisecond query, but if you're executing is clearing millions of times, it can easily overwhelm your database and slow you down. So how do we update the stronger code to make our site reliability engineer is lava. You have two options. The first is by not running that my sequel look up unless you absolutely have to and you can do that by doing it easy peasy or re-check using route by doing this you can save yourself from making

a worthless datastore hit and pray that your database is not going to be overwhelmed with uses calls. In addition to not overwhelm your database. This is also going to speed up your code. Rewriting the code ten thousand times. It's going to take you over half a second to make that useless. My people look up 10000 times. If instead you had that simple line of Ruby to avoid making that my sequel request and you run a similar blackcoats and thousand times less than a hundredth of a second to do it as you can see there is a significant difference between

hitting my sequel unnecessarily ten thousand times and running plain old Ruby 10,000 times and that difference can add up and have an impact on performance of your application. A lot of people will look at that tattoo under code and their first inclination to be a site. Where do you do rubies flow is Hunter the times faster in this case. Ruby is not slow hitting the database is what flow Keep an eye out for situations like these in your code where it might be making a

database request. You don't expect. I'm sure Sammy rails folks are probably thinking exactly writing code like this. Actually, I changed a bunch of Scopes to my work laws. So I have to pass that empty array. Otherwise mice Poop Chain breaks. Faithfully, even though active record doesn't handle empty or raised. Well, it does give you an option for handling empty Scopes and that is the nun scope. None is an active record for a method that returns a chain of a relation with 0 records, but more importantly it does it without

wearing the database. So, let's see where Clause with their empty array. We're going to hit the database when we do and we're going to do it with all of our Scopes attached. If instead we replace that we're Thomas with the nun scope room in the database and all of our Scopes still chained together successfully Be on the lookout for tools like these in your gems and Frameworks that will allow you to work smarter with empty datasets and me even more importantly never ever assume your Libor a

gem or framework is not making a database request. When asked a process an empty dataset cuz you know what they say about assuming Ruby has so many easily accessible libraries and gems but they're ease-of-use can lull you into a sense of complacency. Once again when you're working with a library or gem or framework make sure you have a general understanding of how it works under the hood when the easiest ways to gain a better understanding is through logging that you're logging to the bug for your brain work your Gem and every one of your related Services when you're done load some

application Pages run some background workers. You can jump in a console run some commands afterwards. Look at the logs that are produced those logs are going to tell you a lot about how your code is interacting with your data stores and some of it might not be interacting how you would think I cannot stress enough how valuable something as simple as reading log to me when it comes to making optimizations in an application in finding useless days for him. Now this concept of preventing uses days or it doesn't just apply to my sequel to

any days or you're working with me and diffusing Ruby to replace to prevent datastore heads. To my sequel credits in elasticsearch and here's how we did that. Every night I Kenna we build these beautiful intricate reports for our clients from all their ass and invulnerability. Do these reports start with a reporting object which holds all the logic need to know? What assets and vulnerabilities belong to report? Every night to build that beautiful reporting page we have to make over 20 calls to elasticsearch and

multiple calls to write a semi sequel. My team and I did a lot of work to ensure all these calls were very fast, but it was still taking us hours every night to build the reports till eventually we had so many reports in our system that we couldn't finish them all overnight clients were literally getting up in the morning and there were reports weren't ready, which was a problem. My team and I when we start trying to figure out how we're going to solve this issue. The first thing we did was we decide to take a look at what data are existing reports contained first thing we decide

to look at how many airports are in our system. Over 25000 that was a pretty healthy number for us considering only a few months earlier. We had only had ten thousand. The next thing we decide to look at was okay. How big are these reports airport size directly depends on the number of asset the report contains the more Assets in a report the longer it's going to take us to build that report. We thought maybe we could split these reports up by size some how to speed up processing. So we looked at the average gas account per report. just over

1,600 now if you remember back to the beginning of the presentation I mentioned that are average client has 60,000 assets so when we spoke with 1600 number that's a pretty low the next thing you started looking was okay how many of these reports have zero assets over 10,000 over a third of our reports have zero assets and if they have zero assets that means they contain noted and if they contain no data then what is the point of making all these elasticsearch my Sequel and his calls

when we know they're going to return nothing Light bulb don't hit the datastores. If the report is empty by adding a simple liner Ruby to skip the reports that had no data. We took our processing time from over 10 hours down to three. That's simple line of Ruby was able to prevent a bunch of worthless datastore hits which in turn that up our processing tremendously this strategy of using Ruby to prevent useless. Do you store heads? I like to refer to it as using database cards in practice. It's super simple, but I think it's one

of the easiest things overlooked when you're writing coat. Okay, we're almost there this last year I have for you actually happened pretty recently so remember those rescue workers. I talked about the beginning of the presentation as I mentioned. They run with the help of one of the main things we use right as 4 is to throttle use rescue workers. Dinner started database setup. We only ever want a set number of workers to work on a database at any given time because what we found in the past is that too many workers working on a database overwhelming and slow it down. So we point

45 workers and each database. After making all of these improvements, I just mentioned our databases were pretty happy. So we decide why not bump up the number of workers in order to increase our data throughput so we increase the number were Chris's Sunday on each database and of course, we kept a close eye on my sequel what it looks like all our hard work has paid off. My cycle was still happy as a clam. Mighty man died at this point. We were braids are proud of ourselves. So we kind of celebration for the rest of the day, but it didn't last for long because as we learned

earlier often when you put one fire out your start one somewhere else. My sequel was happy, but then overnight we got a Reddit high traffic alert. And when we looked at our right is traffic grass we saw at times we were reading over 50 megabytes per second from redis. So that's 7.8 from earlier looking so bad now. This load was being caused by the hundreds and thousands of requests. We are making trying to throttle these workers which you can see on this RTS request graph. Basically before any worker can pick up a job

it first has to talk to redis to figure out how many workers are already working on that database if 70 workers already work on the database the worker will not pick up the job if it's less than 70 then I was going to pick up the job. All of these calls to redis overwhelming it and ended up causing a lot of errors in our application like this Reddit connection error. Our application and rightist we're literally dropping important request because brightest was so overwhelmed with all of these throttling requests that we are making to it. Now give me what we had previously learned

from all of our experiences our first thought was how do we use Rubio rails to solve this issue could we could maybe we cash it in active record? Unfortunately after pondering this problem for a few days, no one on the team came up with any great suggestions. So we did the easiest thing we can think of we remove the throttling completely and when we did the result was traumatic there was an immediate drop in redis request being issued which was a huge

win, but more importantly those redis Network traffic spikes of even seen overnight. They were completely gone. following the removal of all the requests all those application areas that we've been seen resolve themselves. Following a removal of course kept a close eye on my sequel, but it was so happy as a plan. So the moral of the story here is sometimes you need to use Ruby rails to replace you need to use them to prevent your day. You might just need a straight-up remove the day sir hits you no longer need this is especially important. For those of you who are fast growing and evolving

applications. Make sure your. Happy taking inventory of all the tools you're using to ensure that they're still needed anything. You don't need get rid of it cuz it might save your day is a whole lot of headache. As you all are building and stealing your applications. Remember these five tips and more importantly that every datastore hit count. It doesn't matter how fast it is. If you're * a million it's going to suck for your data source of dollar bills me or would you be a single dollar bill is cheap don't throw your data store hits around no matter how fast they're make

sure every external request. Your application is making is absolutely necessary and I guarantee your site reliability Engineers will love me for it. And with that my job here is done. Thank you so much for your time and attention of anyone have any questions. 5 minutes Play me at 5 minutes. So if anyone has a general question that also be available afterwards if anyone with a chat, so the question was. Where were you able to downgrade our Reddit instant or RDS instance after we decrease the number of hits

we were making to my Sequel and the answer is we didn't mess we left there besides it was cuz I'm sure at some point we're going to get another bottle neck and then we'll have to do this whole exercise again, but we have kept it the same size that it was at the beginning. Thanks great question. So the question is are there any cash busting lessons that we learn from the RDS example, we have learned a bunch just from all of the examples in one of which I think is the most important is set a default cash expiration rail gives you the ability to do this in your configuration files. We did not

do this from the beginning. And so at one point we had keys that were had been laying around for five plus years set a default. So that's everything at some point will expire and then to Finding that I deal expiration how long a cash that live in it takes some tweaking, you know set it take your best guess and set it and then observe how you know what your load looks like how clients are reacting to data do they think it's stale do they not and then from their tweet that's what we have found is every time we set a cash expiration. We always kind of go back and speak

it afterwards. Okay. So the question is do we have any issues taking when we were story and redness and then now storing it in our local memory cache on the answer is no in this case. The cash is so small. It's literally just you know this piety matches to this name and Majora the time that he was hiding that hash was only five ten keys. And so it was it was it was a very small hash and obviously the payoff was with super big. So in that particular case, we have not run issues with that. anyone else

Alright. Thanks guys.

Cackle comments for the website

Buy this talk

Access to the talk “RailsConf 2019 - Cache is King by Molly Struve”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Access to all the recordings of the event

Get access to all videos “RailsConf 2019”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “IT & Technology”?

You might be interested in videos from this event

September 28, 2018
Moscow
16
173
app store, apps, development, google play, mobile, soft

Similar talks

Gabriel Enslein
Senior Infrastructure Engineer at Heroku
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Justin Searls
Co-Founder at Test Double
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “RailsConf 2019 - Cache is King by Molly Struve”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
644 conferences
26393 speakers
9799 hours of content