5+ years combined experience in both sales and computing. Customer service oriented, with focus on building strong relationships between product developer, marketer and customer. Shows keen attention to detail and exhibits natural leadership qualities when working in a team environment.View the profile
About the talk
RailsConf 2019 - Keeping Throughput High on Green Saturday by Alex Reiff
This is a sponsored talk by Weedmaps.
Every April, we observe Earth Day and celebrate our planet’s beauty and resources, its oceans and trees. But only days earlier, another kind of tree is celebrated, and Weedmaps experiences its highest traffic of the year. Come see techniques we’ve used recently to lighten the latency on our most requested routes ahead of the elevated demand. Do you cache your API responses but want a lift in your hit ratio? Does that Elasticsearch best practice you know you’re ignoring cause nerves? We’ll pass our solutions to these problems — on the left-hand side.
How's everyone doing rails comp day 1? Welcome to Room 101 e my name is Alex Reece, and I am 101 elated to be here. This is just my second time coming to railsconf at my first time speaking at any conference. So super exciting go easy on me. I am from a little company called Weedmaps may have heard of us. We are the largest technology companies solely focused on the Cannabis industry. We have solutions for retailers Growers. Wholesalers consumers pretty much every Market
vertical at the Cannabis industry. We have some kind of software as a solution service for and today is a big day at Weedmaps and how we kept throughput high on green Saturday. So my team is focused on what we call our Discovery API. It's a read-only API that serves most of the content on our front end. So the website and our motives are hitting the discovery API to get their contact its data source is elasticsearch, which gives us super quick reads and a simple horizontal scaling mechanism when the
pressure gets too high on the cluster. Resource elasticsearch from RQ pipeline that we've implemented called a bit which is the layer on top of Robin and Q. So messages are put on the Q by our core application. This is mostly our business owners working with the admin panel in the CMS. Just Corps application host most of the business logic and I'm in teens The Source stayed in postgres. Do you want to hear more about postgres and how we do it come to naught duszynski and Craig bucheck talk tomorrow stuff about postgres active record
all that good stuff from Coeur entier Avenue Q to the Discovery fly into elasticsearch and then people surrounding cloud of Smoke Gets Out There. If that's what it sounds interesting to you at all. If you want to come talk about it more come see us at our booth of the exhibit hall. We're definitely hiring. Now the discovery API has three main areas of focus firstly it determines where I am, then it determines the retailers and delivery services cannabis doctors near me and then it shows me their products and the best deals I can find on
them. Sew-in short when people are searching for green they are using the services that my team are responsible for. And there's one particular day when all kinds of people are searching for green. Now if you are the conference program you probably read green Saturday and relax. A real thing. Did you make that up? Well kind of So this is the second to last weekend in April A. Lot's going on on Friday. We observe the first night of passover witness the miracle of the burning bush and remembered that
on Easter Sunday. We remember a Jesus Christ ascending to a higher plane. And on Monday, we celebrate our planet the oceans the green trees the beautiful and all around us. But on that Saturday two days before Earth Day, there was another holiday. What I'm calling green Saturday. Now I haven't asked us there because it's not always on Saturday, but it definitely is always on this date April 20th. You may know it more closely colloquially as 420. This is the day that we celebrate the Cannabis movement all the progress we've made and access and
Criminal Justice Reform and celebrate people consume a lot of cannabis. So to get that load they need to find it and to find it they come to Wheaton house. So we enough on April 20th experiences are highest traffic Spike of the year and it is consistently elevated year-over-year. Neither Weedmaps Essence 2016. You can tell by my limited edition early run white Get High shirt can't get those anymore. So 2016 was my first year at Weedmaps came around 4:20. I was expecting a party. Unfortunately that party was
over go to meeting because our database on the core application was deadlocked within a few hours of the start of business. Needless to say that was a dark day and Weedmaps history. But from there, we really got serious about hardening our systems. We brought devops in the house. We started the hiring spree which we are still ongoing today and every year since then we've been making improvements. So in 2017, we split our postgres reads and writes Mark or application are end-user traffic is very very heavy so that took the pressure off for masternode. I'm so nor
deadlocs good stuff. If I buy 2017. We also had this new V2 API back by elasticsearch. It was used by a few routes didn't use by a few clients for a few routes didn't do a whole lot, but it did take a little more pressure off the core application. So Fight a little shaken us. We stayed up in 2017 Goodyear on by 2018 California had gone recreational. We needed a lot more scale to do that. We can pain raise their applications put them in Docker. We use a system called Rancher to orchestrator doctor in
containers allows us to scale up really quickly change configuration if we need to and we also by 2018 most of the traffic was going to this V2 API. It had grown up. It's now the discovery API, but it was the original small elasticsearch to Cluster weeds that way back in the day. It was having some stability issues. So we went ahead upgraded to the latest version of the time last exert 6 and more importantly. We really got her index configuration tune to the day that we have. 2018
420 was a Friday. It was one of the more boring Friday's it had a WeedMaps. good stuff cut the 2019 Weedmaps 3 years my white shirts threadbare kind of sated and are trafficked Baseline is about double what it was a year ago. We had a few high-traffic events things were a little bit shaky. But what do we do? What's the low-hanging infrastructure change? We can make what can we give me a shoe horn devops into fixing for us. Obviously kubernetes no, not that talk. We got serious about handling our request. So here is a view of the Weedmaps on
page and see their two main routes the power both of the convent flash location / Brands categories. The location gets the bulk of the traffic has three main phases one. It determines the user's location based on geolocation provided coordinates a device to your location. And that's that location to one more sales reasons based on what services are legal in available in your area. And then it pulls the business advertising in those sales Regents now debatable whether a single lady I should be serving all that information, but that's an ongoing push and pull with our friend and clients.
I'm sure the back end engineer is here can relate to some extent but in general data that doesn't change with extremely high-frequency think maybe we can hash list. But we still want a good experience for a business owner who updates their page. Maybe they have a new logo and leave the trading under a new name or they buy some new advertising position. They want to see their stuff quick. So is there a compromise that compromise is micro cashing? Now micro cashing is a strategy where you cast the entire web response for a brief. So in this case micro refers to your
time to live not necessarily or cash payload, you're like us and have the excuse request your cash to pay little actually kind of dick you heard of microdosing. Kindly office said that you took a small little toast and it lasts for a long time this week. Take a big thing holding onto a small little bit of time. Now we love nginx at Weedmaps and we have it enabled at many layers of our application. Application stack next mixes cashing really easy as you can see it line to we have this proxy cache path directive
by default is stored in the file system. So you see the path are you define acuzone this case? I'm calling it my cash and I'm saying I can hold a hundred and fifty million cash keys and there's some other stuff you can configure in this case I'm saying I only want to hold on to 200 megabits to megabytes before I start purging. So scroll down a bit to line a UCR location block. This is where we proxies Puma or unicorn or whatever app server is. Add a little bit below that is where we can figure out proxy settings. So we say I want to use that my
cash. I just to find I'm saying I want to hold onto only the 200 Response Code and I'm holding on to them for half a minute 30 seconds and the proxy cache key is where I determine all the things that are unique about my request in our in our case. This is pretty complex. We have some routes that are based on geoip different device things. They send us but this is just a simplified example. We're just using the nginx variables the hostname the request URI and the authorization header. So people are logged in their unique content doesn't get cashed. But we're in
Docker and were you going to local file system? So we have many instances of this local cash? They can't be shared so I cash it rates not going to be great. Here comes openresty to the rescue. Openresty is just nginx there with some Gourmet accoutrement. It's in comes with a Lewis run time and some libraries to make that small language a little easier to work with and it comes with the series engine X modules that allow nginx to talk to non HTTP upstream's like them cash. So just like we can figure are rails. Cash instance
from him cash to store. Our house values across apps app instances. We can do the same thing here with in genetics using openresty. so we roll out openresty at the micro cashing How We Do not so great got about 5 6% cash it ratio. So we don't have that many kinds of requests but we find that we get a lot of variation in them. Why is that? Well, we rely on her mobile devices to send two users coordinates geocode their location and mobile devices then very precise coordinates or staying
up till like 12 decimal points in our lives really crazy and most applications don't need this level of precision. But here's a chart that I stole from Wikipedia that I turn into a graph. It's charting the the distance that corresponds to each degree of decimal Precision. So distance on the Xbox S degrees on the way. So as you can see below the fifth decimal point RIT sub-meter Precision, probably not that important. Or a little more concretely here is the location of Are We nuts after party at the Aria at noon on 1st Street and just a few blocks from the
convention center. Hope to see you all there tomorrow RSVP at realcomp.com parties or come see us at the booth. Now Google tells me that this coordinate is 44.98 4549 degrees north by 93.268 504 degrees west. That's a lot of numbers. Maybe we don't need all of those. Let's drop it down to four and we're in the same block. In the same building, so maybe we can go down to four try to how does that do? Two blocks away. Maybe that'll work for us. Maybe it won't depending on our use case. Maybe one try that. Well, you can't see one cuz up the
interstate and North part of the city. Probably a little bit too far away. So The Sweet Spot probably somewhere between 543 maybe 2, depending on your use case. So where do I do this rounding? Well, we have a dry struct in our application that models coordinates. Shout out to the dryer BT. We love you because a lot of your stuff in Discovery Pi so we can easily stick around method in here. The alliance. We just passed in Precision. We got a new instance with the latitude and longitude of rounded to what we say pretty straightforward. We can use that
when we pass it into our queries. So pull it out of crayons online one do that rounding online two and then stick it into the query at line 3. But if we're doing it and rails were already behind the micro cash, so that won't really help our cash rate. Remember we ever maker cost of an engine X. So we think maybe we can do some kind of rewrite change that request-uri before the cash is accessed. And indeed that's what we did. The sort of nginx rewrite we chose to do was done using a Lua plug-in as the API Gateway that sits in front of all our API services.
So not Discovery Peace Corps application. I was talking about we have some Elixir Services the power B2B applications for curious about those can find us. So all those two different services are proxy by this Maingate way and that Gateway is powered by a service called con So Kong is just openresty. So it's just engine X, but it's special twist is it has dynamically implemented route and for us it has an awesome plug-in architecture that works just like rails middleware. So there's hooks about if I the request and response at different stages of the handling.
But I'll say Lua is not so fun. So someone maybe wants to write some Crystal bindings for in the next. Be dope. So talk to her before graph cowshit rates about 5 6% We roll out the plug in and enable it on her Discovery API routes and how did we do? Much better. 9 10% I'll take it. So cool engine access handling a few thousand requests now rails and take a break. Thanks nginx. But I can't really take a break cuz we still got thousands more location request process. So I was trying to look into our route controller. We use near
Ellicott Weedmaps as our application form Monitor and has really detailed transaction traces. It was telling us that our region query was consistently the slowest running operation. We turned on elasticsearch slow query logs. They confirm this thing. So out of these three main parts to the route the geolocation determining the user's region and then pulling up the businesses advertising. We chose to focus on the regents. Fortunately for us though our sales team is not too often change in these regions boundaries. Once they've been
established. So given a coordinate it's not very likely that the reasons going to change minutes a minute or even really day today. So let's cash that but a little longer this time we don't do that microcache this case. We're doing 10 minutes. So here's that rounding snippet again pulling our coordinates out passing it to this rocks doing the rounding. And now we're just using a standard rails cash batch. So passing our cache key to find our time to live saying 10 minutes early probably could have been 10 hours doesn't really
matter. If the COS she hasn't been set or the old value is expired. Let me get that block called Line 6 for doing that query again. Circle this is what two terms. Are we through cash pretty simple pretty standard is what most people think of when they think of cashing but there is an alternative ask yourself will the date I get on my cache refresh likely be different from what I have stored will my users even notice. If it is you can say no to either of these questions you might want to replace the simple
cash with what's called a right behind Josh. So what's the difference with the right behind if the cache key is expired rather than going to refresh it as part of the Fetch and returning the new data the expired cash value gets turned but at the same time I'm willing to a background worker and that bathroom workers job is to go Fest the new data and store it in the cash. So the next person that comes and gets it will have a new date. That's something we had to implement ourselves. So check out the set method
online 7. Normally when you'll call your cash, you'll pass the the time to live to the set as part of the the parameters to call. But in this case we're storing are trying to live doesn't expires at as part of a couch pillow. So you can see that on a line 8 and 9 were setting up for cash payload. When I go to fetch it in the schedule, we first check the expiration date. At 9:16. If we're past the expiration before the expiration date everything still good go ahead and return our
payload at 9:17. But if we're past the use-by date, we first have to call the refresh before returning it. And that refresh just fires off a background worker that calls or API or whatever the resources and then sticks it back in that cash so you can see at line 27. We just have a simple active job and that's responsible for calling the set this part of the cash. So the ultimate goal here is to shift any bike and Upstream latency from the end-user as part of their synchronous web request to the background. It's much better to have a slow psychic worker than a slow Epi route.
So in this use case are Upstream service is elasticsearch where we have all our region storm. So what are some ways that latency can increase with elasticsearch one-week notice burst of rights under heavy reloads vs. Latency. So let's say around 4:20 about Noon Saloon open since 8 a.m. Running out of stock. So they hit that sync button on their POF to take all the out-of-stock items out of a huge many. Why doesn't stay 4000 items. I'll be there in a couple of those will see a little bit of cycle latency. But there's a second thing.
continually running expensive read queries So elasticsearch has a future where you can join documents into a kind of parent-child relationship. We use this feature to store our region document metadata separate from the geometry suffice it to say when we implemented this it was a little more convenient to do these Delta updates not having to pull the geometry around every time and clearing was still pretty convenient to you can express a query that was like find a region that has a child document that has a geometry that matches matches my coordinate.
It's not the nicest code to write if anyone's work with elasticsearch DSL. You got these like super deep Destin structures. So the Ruby wasn't great, but it worked. Or did it. As the kids say yikes. If you care about 443 performance, you should not use this query coming straight out of the elasticsearch query documentation. And we assess our main route. as is so often the case in life what seemed cheap and convenient at the time and the being expensive and harmful down the way
occupational hazard, so What do we do we re thought our index in strategy be collapsed the two documents into a single structure. We built some tooling taking advantage of the elasticsearch painless painless scripting interface and we maintain those data those Delta updates. How about do for performance? So our people for the release was about 45 seconds of elasticsearch timing out locations route. Our new Peak was about 30 after the deployment. So 33% Improvement what just we were hoping for awesome. So is there any other
place we were using this we could maybe make it the more optimization? Naturally the answer is yes. So remember that other route from the homepage with the brands categories, you see the flower or concentrate previously. This homepage would show the brand logos instead of the product cards. And that Legacy Future made use of the join documents. Fortunately the new future did not so we learned our lesson at least that time but our clients were requesting both the new and the old data.
So even though we fix it going forward we still had that knocking old query. Well, we work with our product team. We came up with the strategy to Matthew the new to the old so we could use the same query to generate both the new and the old response. And we eliminated that join. With such a great result with the Region's replacing this would probably have a great result to write. What's the what happened there? Now you might be thinking that big brown spike is elasticsearch and we did something
wrong there but elasticsearch is actually the purple up and got the better. Not the brown is the time spent converting elasticsearch as query response to our own API response format. So if anyone's work with that, you got to like big hits hits zero Source pull out your field not so great. And then we have to map. To our own data category is Json API SMS, so essentially Ruby using memory and taking CPU time to build houses. And it turns out just like 15 Keys
Keith to make hash moving data in and out of Ruby houses is pretty time consuming, too. And it's expensive NC. Now beyond the scope of this talk, but take my word for it there a lot of data structures involved with this house operation. So this is the code behind just a simple see look up simple house look up in c o c developer, but I see if the nested conditionals. I see you go to statement. There's some loops and I know pretty universally Nesta conditionals often cause some pain.
So what caused the Big Spike and Ruby wasting time with dumb? Hash? No one likes hash. No one smokes anymore as it turns out. That when we started using the new non join query result to build a legacy response. We ended up person that elasticsearch response time two times over so twice as give you time. alright darn How do we fix that? simple memoization But what really was the problem here? The problem was that we forgot to test the performance of our changes. We made some assumptions about how change
manipulating our data with perform, but we never confirmed those assumptions. Once we did on the fixed we are able to confirm the latency drop back down to what we'd expect. So two routes to set some fruit of improvements ready for green Saturday 4:20. We're pumped. But the only performance test that really counts is how production performs on that big day. Anyone curious how we did? All right. At our Peak we were serving 100,000 request a minute through our Kong API Gateway.
53053 per cent of those were going to that Discovery pi. 20% of those were those location request. So all that little fine microtuning really really paid off. I cash it right was a respectable 9% So I'm not 5,000 requests coming out of the cash thinks nginx doing a good job. And this traffic is not quite three times the normal throughput we see on Saturdays and what this are average latency stayed under a hundred seconds. in fact, the average was about 80 at our peak time,
so Doing great, sweetie. And indeed we had 100% of time. So quite the success as you might imagine that party if I was expecting way back in 2016 was in the air on this green Saturday this 420. So what do we do to achieve the success? Well for one, we got a lot of stuff. different layers of the service so you should consider whether your users always require the most up-to-date information coming straight from your source of Truth or whatever datastore you use if you run a public website, there are probably more than a few cases where you can cast your web
responses at least for a brief. Junior user input to make it work for your application the various sensors and signals that we carry with us all day long enough smartphone can share a why is that a very very precise stated with our application that our services Depending on your use case. You might want to manipulate that data before plugging it into your business logic or your cache key generator. The GPS sensor does not know that a particular route needs only to know the user location with the ZIP code level granularity.
We limited the external requests we made that might affect our users wants time. One of the worst feelings is looking at your application performance Monitor and seeing a big spike in your response time. And then you trace that back to go to the external Services Tab and you see if there's a big corresponding Spike and that external service months time. So where you can try to move that or external request to a background worker and persistent for your API service to fetch later. It doesn't make sense to do this kind of right behind cashing. You can have a Cron job that refreshes it on some
kind of regular interval. So think about what kind of data really matters to your users and how up-to-date that needs to be. Please please re-confirm your schema is right for your queries this applies to any database. You are your defining your indexes elasticsearch post grass my sequel whoever you're using often Azeroth evolve your initial index setup made no longer be ideal. Maybe there's some new screen you outed that's clearing or filtering on a field that wasn't originally intended for filtering or maybe you're doing some kind of joint in an unexpected way.
These things come up in our case are index setup was just proved to be against best practice. So I recommend going back and reviewing the documentation for the latest versions of your database. Amongst the new future details. I found existing functionality Goetz clarifying watches that are discovered out in the wild are called out. You might feel called out for some things you're doing don't get discouraged. And finally do those benchmarking. When you're in the midst of making these improvements to your application, it's easy to move a little too fast.
Once you have a fixed it's human nature to want to rush to get into production and get the best result for users. What's been the time to set up that performance test there are many tools out there you can use to simulate load. There is Apache Benchmark in jmeter that we use at Weedmaps. Wrkr run for some other ones that are out there. So check out your master Branch run your tests there to get a baseline then compare the results. When you check out your future Branch when you validate your assumptions, you might just catch something unexpected before it's messing with your Optics.
In general what we found the key to high throughput was giving our rails up and its database of break take advantages of the other services. You already have set up likely whatever proxy layer you have or your elasticache stores are probably underutilized. Let them take some of the edge off. And I Weedmaps, we do a little something about taking the edge off. Thank you.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.