About the talk
Let's be blunt: Most web apps aren’t so computation-heavy and won't hit scaling issues.
What if yours is the exception? Can Rails handle it?
Cue Exhibit A: Cloudinary, which serves billions of image and video requests daily, including on-the-fly edits, QUICKLY, running on Rails since Day 1. Case closed?
Not so fast. Beyond the app itself, we needed creative solutions to ensure that, as traffic rises and falls at the speed of the internet, we handle the load gracefully, and no customer overwhelms the system.
The real question isn't whether Rails is up to the challenge, but rather: Are you?
Backend Developer at Cloudinary, building digital asset management solutions at scale. Learning to optimize peopleware (myself and others) through autonomy, empowerment, and learning.View the profile
Hey everybody. My name's Arielle Kaplan. And I'd like to welcome you to my talk, the trail to scale without fail rails. Before we hop into the content of a talk. I want to take a moment to step back and look at the image on your screen right now. And I want you to think about what's the reason that I chose this image, which doesn't really seem to be related to scale at all for this talk. Do to me what I see here is, there's one user one device having one experience and we talked about scale. It's really easy to get lost in big numbers
and all kinds of fun technical details. But ultimately what we're really trying to do is to create this kind of individual experience, many, many, many many times over and I think that when we think about things that way when it when weed makes it rough focusing just on you know, what, the average response time for things like that, but we really thinking through what's the ultimate user experience, we get to a much better result. We come to that took too much better Solutions. Now it's time to the other side of the spectrum related to dive and get technical. So
what do you think is the biggest rail. In the world? And dumb. I started thinking about this probably a few years back when I saw this talk at Ruby on Rails, called the recipe for the world's largest reals monolith by if you know by 2. During that talk, he described cookpad in the traffic that they got. He's talked about a number of other metrics in terms of just describing the gargantuan scale of their rails monolith. But when I want to focus on you, no kind of arbitrarily chosen. So, sorry about that his request per second. In this
case. He said that they have 15,000 requests per second, you know, pretty pretty impressive skill. At least, I thought. So at the time, working on my, my company of the times I have had a few hundred requests per minute. So definitely a very different scale. Next year at railsconf Simon eskildsen, talk about Shopify where he described that they have 20 to 40,000 study request for S A peaking at 80,000. Ft. 20 to 40000 though. And actually just a few months
ago,, who is a VP of engineering at GitHub, said that the qwp ideals with about a field with over a billion API calls Daily Citizen. If this is really upsetting of their overall traffic, I'm going to assume that a yes within the general order of magnitude cuz that's about the best information I can get. So we have a few different applications really working at scale that we can use of time examples from the community. So if we want to compare them, while we normally listen to request for day so cookpad is about 1.3 billion with a B
request. Shopify is at 2.6 billion and 1000 2016. I don't know how cook pads grown since then. I mean countries, Shopify has nearly doubled the number of stores that there that there are posting. So I get there. Probably around 5 billion by now, get Hub again, just a few months ago. Was that a billion Southbury work where I work looking at our entire system? We take about fifteen billion requests per day. And I want to point out. These are not just regular request that you might know a thing about being.
So, keep the rails happier actually, for the most part in media requests. I'm so it's it's really, you know, it's like take the other three and you double up and then that's basically the scale of of cloudinary. But just give you a sense of where we stand in as I in that world,. Can I play with numbers a bit? So I apologize server. The server probably talks in database, get some information and finally, send some kind of response back to the to the client computer weather in HTML or Json format, we
deal with cloudinary. It's a little bit different. Now. I'm going to keep this very simplified view of the app, even though both your apps and an AR app are much more complicated to the architecture. So, so cloudinary, I generally is going to be serving not Eastern or Jason but images and videos and in fact, not necessarily just images and videos that we have stored. We are probably going to be some kind of transformation. Make some derivation of the imagine if you actually a very computationally intensive process, and that's what we're doing on basically. Each of these requests.
So let's just dive little bit deeper into what is cloudinary actually do. Cuz I do think that it's really important, especially if you're not familiar with Cloud, married to understand the nature of the service of the kind of use cases that we have there for the kind of traffic that we have. Therefore the kind of load that we have to deal with and that will carry us through the rest of the talk to focus on one image. And how is your knees through the process. So the first thing of course he upset you can upload an image to Cloud Mary Lee store in one of our Amazon S3,
but it could be Google Cloud Storage in certain missions and we deliver vsetin. So probably most of you can imagine what uploading and storing are the latter, two items on the list are probably a bit more confusing and so I'm going to go shooting right now. Just you can get a sense of of what's actually happening there since maybe you don't really know what Transmissions are made. You never heard of a CDN. What even is that? So let's go through that. Do not know you have an image. This is an image uploaded by users. Are you it's not really crop perfectly. Got a
lot of you know open space there and it into one of a number of different layouts on your site. So you probably have to have a few different crops and others though. There's the portrait crop in the landscape crop and you want to cry fall. So intelligently, you don't want to become a random leaving to take a slice or scale. It. I'm kind of gross why you really want to think about how you going to crop it to the most interesting parts of the image. Maybe if you know, these are going to be avatars my really focused right now on that face and the truth is, you know, what, you might want to
have different sizes and shapes of the image. Maybe you want to have a round form. You don't give you a nice big one for the the author's profile page with it. Then when someone post a comment they have like a little tiny Avatar so many many different versions of the same image. And I the truth is you might not want to stop there. So, you know, if you have this again, this is lovely arbitrarily chosen image, you know, now cross nicely for your layout and you know what you want to make it pop a little bit. Brittany Hansen color is a little bit to watch closely as we we do that. Make the call
yourself in a pop a little bit more. And then we said, you know, what? Well privacy is really important. These days, you will care a lot about it. We're going to pick up. Are you good for their privacy? And on the top, right? Of course, you want to add our company Watermark at about 50% opacity. So, all these things you can do in cloudinary by just adding on Little Bits to the URL. So we can see here on the top, on Top Line. We have this improve indoor if that's all the way on the right, which is what was doing, the the color change, we have then we'd
have the number of kilometers away the height a crop mode gravity Woodruff on the face q. Auto is another thing that we do, by the way, again, I'm not trying to like sell you cloudinary. If it if it happens to be great, but really needed my goal here that you really understand, like how heavy is requested. I'm so so cute. Oh basically takes the image that's going to be ultimately generated and try to figure out. What's the optimal compression, right? So are we are we going to compress it? A lot of giving it a little bit based on what's going to reduce the file size, but keep the
the image quality, very high for what's only provided to the user. For there, for the visual experience, on the news, the pics like faces, which identifies what the face and then pick slates it and then on the last line, we have the Logo put on the top right and that without with no pot of a pass and you can see the end of that long string, of course, the various crops and positioning and all that. And literally can go to this URL. And if you change any of the framers will generate a different image, you know, starting from from that initial image and then and then going down the line.
So every request, the cloud and I just give you a sense of how these fifteen billion request, right? It can be doing a ton of work. I always think about the original. Imagine how much we have to go through just to get here. There's more because actually your user probably accessing your site from many different devices. Different browsers and it turned out a different different devices in browser. Support gif, image formats. So I left you the really good example. It was supporting Chrome for a number of years. Firefox, a LinkedIn support in late, 2019 XR, which is kind of a Microsoft.
Mostly I use that format. So you don't want to know serve it for Edge or Internet Explorer but maybe not forever from coming down the pipe little bit competing with start with a v. I f equal to a little bit closer to to see if I guess I'll just have one of the people in the cloud. Mary. Is it the researchers? I'm working on that. There's a lot of difference in a different format space. If you might want to serve and you have to decide how your descendants will cloudinary, will generate the Right image for not based on
the parameters of the request. Finally, let's talk about cdm's. So here's the idea. You have your servers. Maybe let's just arbitrarily choose Northern Virginia, USA US1. It's a very popular AWS region, but your users might be anywhere around the world. Right? And when you make a media requests can be very heavy is a lot of bandwidth and you want to serve it from a server that's closer to them. But your apps offers are only one place. So the TV on. We provide you with a network of computers all around the world. By the way, if you DM providers in there, they're growing all
the time, but it says, network of computers around the world to be any kind of short for Content delivery Network. So the network that delivers your content and it will cash, the meteorite, the image of the video, very, very close to where the user is and every time that are you hear from that? Area? Will request it. It'll get milk at directed to whichever is the, is the edge know they're called the computer on a network. That's, that's closest to them. And, and they'll be able to Get the media from there. So, that's a lot of time in terms of how long it takes for the user to let the
devil the content. So now you're a little more familiar with with how things work, the images Journeys through through cloudinary and you can get a sense of what kind of what kind of traffic we're dealing with. An ultimately, we are serving things throughout the receipt is really important in that in a minute though. Cuz percentage of request do come from other other things out of media library, which is a gooly for media cataloging and editing. We have various operations. I'm available through a
really really heavy. By both operations could be things like delete 40,000 assets. So, you know that the buck operations can be very, very expensive, proper Asian, but I'm still not going to focus on them because that's all in all Point. 3% of the total requests. The more interesting question is, how do we serve the 99.7% of her classes for the delivery pipelines delivery, meaning of transforming and serving media? Also do video, I will not focus on video in this talk. It brings in a whole different set of
considerations and really difficult things to do problems to solve. So I'm just going to focus on the images for, we dive into you how cloudinary solve this problem to feel weird about myself cuz I have an excuse myself myself yet. So everyone again, I'm sorry. I've been working since graduating. The Flatiron School in 2015. I've been working at cloudinary since 2018. I am at a.m. Kaplan everywhere on the unit that matters which is basically Twitter and the and get out. And by the way, if you are not in the
conference and do you want to contact me? If you're watching this video later and you want to get in touch, so, please, I reach out on Twitter. My DMs are open. I'd be happy to answer questions. That way. Last thing is I really love conferences. I've been in a lot of Real Compton rubicon's. I really love speaking and I really love interacting with people and hearing your different stories and perspectives and I say that and emphasize the address to say that there's a lot of really cool stuff coming up right now. I did not personally build a joke about any of it. And I'm really fortunate to
have had the opportunity to talk to people who didn't make a lot of this stuff. So I apologize if there are questions today, but I can't answer. I will be able to get the right answer from the right people. But, you know, my main interest here really is saying It's really cool system and end. The rails world really needs to hear about these success stories of relative scale and and I just wanted to share that with that with all to you and at the about your story. What's Ivan now to how cloudinary scales? And actually one thing I'm not going to talk about. By the
way, the scaling piece of it auto Stealing. In other words, all of our our our servers basically are are in the cloud. We can scale, different different groups up and down. There's a whole garden signs to to scaling and how to do it, right? I'm not going to talk about that today, but just know that everything every service I'm going to prevent to you today. If I mean that can be independently field as as you would expect, and we do I'm so that the topics and recover our number one dividing our services to various layers number
to starting number three. The elements of location like geographical location number for HD. Duplicating work making sure not to do the same thing twice. Not stealing or deciding, you know what to do with me. We can't scale just yet and then human factors because those are no less important than anything else and they do play a big role in our scaling strategy. So let's dive right into the first topic, which is layers. And anytime I hear the word layers. I just think of this quote from Shrek and I'm sorry, I can't do the accent. But, you know, donkey
trying to say, do, you know, ovaries are like like layers, baby like that. Like a layer cake and she gets very upset and says, layers, onions have layers, ogres have layers, onions, have layers, yet. We both have layers and got married. Like ogres are like onions, also has layers. So, let's go through with the basics of an actual request. Life cycle in Cloud Mary system, the more in-depth diagram compare, with what we saw before it is still not the full picture. If you want things that will a dinner, we won't add but this is really a more
complete story. So we started using computer, they're going to make a request and actually it's going to hit the CVN first. The ophidian will, if it has the battle-ready cash, it's just going to serve a computer anymore. But if not, it's going to go further down. The pipe is going to hit cloudinary servers and now we really start to do the work. So the first thing that it is a system we call second. I put up a shield there cuz it kind of serves to seal our main app from the bulk of the traffic suck. It is a very simple service. It's it's written as
S3 get here because that's that. That's, that's why it's named that it really, just basically says, okay, based on the request URL and whatever other information, we have. Let's talk to the correct. Cloud sword can be S3. You can be Google Cloud Storage. It was named in the time when at 3 was the only kind of storage. So there's your explanation of the name of the cloud storage. See if we have any information available. And if we do, then we can just, you know, send it back down the pipe and I never bother anything deeper. In case we haven't created this sub is a that yet. We will then
talk to what we call the aisle there and the aisle are we called that? Just cuz it's islebound and that's kind of the main app that interfaces with just about everything in the in the outside world. Dial app has access to our last talk to my cloud storage pod storage and if it needs to be some kind of transformation it will then I'll talk to the fuel-air that you later is the more complication Ali intensive layer. That's what we call it CPU cuz it is mainly CPU bound. Okay, and of course when the CPU, you know has the resulting
image it will send it back to us. And finally to the computer. So here's now the explanation of how how the layers really help us there about fifteen billion request music computer to the CD player. In 14 out of 15, cases of CD and actually already has the assets toward and so it never needs to talk to Mary server Forsyth in 1 billion of those cases. It will hit Sega. Second will then pass on only about 15% of those requests 85% of requests. It will deal with but 150ml request a day are going to hit the aisle are
and about 125 million. And up. That's about the United time for her to call the CPU. Now, it's on to Technologies. So I only few are both rails are actually the same code but running on on different servers reasons. We'll see you in a bit and go. So basically we have the CD on to handle most of the traffic out of what's left of a little time. You go out that handles most of that traffic and then rails is dealing with a hundred and fifty million requested. A solo about 1% of the number I said earlier. So my apologies to the to
the other companies that were on the slide, you're not technically that much bigger than them if you just look at rails, but it. Yes, if you do look at at how much traffic aaaah rap is ultimately serving. How many requests are real Depp. Is it worth of Truth for it is fifteen billion, so that that's how I first met Justified that number All right, let's talk about the CDM. And the truth is we clearly need it, right? If you're going to be a company that's involved in certain media. You need to be working with some kind of a CDN, but there are companies that have chosen to build our own
CDN. There are companies who choose to buy and we have gone around of buying. We are paying a difference again, Partners mainly a compliance Astley to be able to handle that traffic for us. And so the big benefit is that 95% of our traffic isn't our problem. Now, to be cleared. This is not the heaviest part of our traffic right now, there's no processing and all this traffic but there are certain challenges that come with that kind of scale and we just don't have to deal with them. So that's great. Celebrity Best in Class service and features that are seeing Partners. Give us that we
didn't have to build ourselves by which is really nice. On top of that. There is a real benefit for liability because we have is Maltese again feature, where we can basically switch between, which CD and were using to the trip request and send me that if once again, partner for whatever reason coming down time, which is extremely rare event that I can think of, in my recent memory, but if you fail or to the others to be on. So there's a lot of benefits hear of a Vine by
your own apps. The downside is, we need to play by their rules. Don't work. When you're you're paying for a service, right? The service has certain rules that you need to work with. So we have this feature called F Auto, we're rather than you having to decide in the in the URL. What kind of the fake image is going to be, you just write a photo and then we'll say well, in a given certain version, whatever it is, right, we're going to serve a thousand, but it's a different request form
at like, like webp. So we can't handle that on cloud Mary's servers, right? We have to handle this on the level to see the end, cuz we don't want this to be on to have to talk to us every single time. So we have this is just one example, we have to have tons and tons of custom rules that we have to write. That's going to leave on the city and we have to do it in the language that they give us an actual defense against have different. Languages were maintaining a lot of the same logic and in
different languages. Also, every provider has her own and validation system. So when someone lets he deletes and ask. Inside of Cloudera within, have to go ahead and invalidate it at level 2 CD and otherwise, I will basically keep responding with that but that image until the, the cat expires. So we have to actually go in and validate that cash each provider. I had the wrong about that. We need to go early and obey their limits and it's actually a non-trivial thing and I work on the building team. So I'm very familiar with this part. We need to park there. A lot of
files to block customers. We don't have to like look in and you know, they are are your keys back to to check our logs and say, hey how much, how much, how much been with this customer use? We actually have to look at the log files and see the ends in whatever format they're able to give us and then figure out which, which customer you never posted a particular image or, or a user new customer. They are servers. So we need to, then then do it that way. So, certain challenges on the whole still very, very much a good thing that we are not
building it. In my opinion, you pay for for great systems that do a great job second. So if you put a very simple service written in go, it's just a couple hundred lines of code. It's really not much. Mm is equal to handle 85% of the requests it receives and compared with IO which is it going to realize that they takes up about 10% of the Computing resources. So if you think about your being for your buck, how much how many, how many requests you can respond to, for the amount of computing power, right? It's like two orders of
magnitude better than are real. So it's super fast, which is great and really was very very little coat and end. It is a really good way of handling a lot of traffic Route and having to you know have massively more servers to deal with instead. We can just have this very very lightweight system. That's going to that's going to handle. Most of the traffic and let rails focus on what's more interesting. The only downside really is that we need to duplicate some groovy logic and go and when the Ruby lot of changes that changes go logic or vice versa, there are times
when the library in two languages worth more or less the same and he didn't seem. So he a little kind of HTC bugs to fix that, but on the whole it's been, it's been a really great system, that's helped us kill without feeling too much or too quickly in terms of R&R Hardware to to make it. You're not worth it anymore. There's some other advantages of layers just in general larrikin scale independently. So again, I'm not silly. In a few cans autoscaling. But if you think about you daily to expand or contract, any Lair, without no other wires, like, that's that's really great. So
I'm just just to give kind of a very simple, you know, use case. Let's say there are tons of a problem to scale but we don't ever even have to think about it because this again is handling that and that huge Iverson traffic. So nobody can can concealed without anything in the inside of clouds are really staying at the same time and it that's true at anyway, right? Maybe there are a lot of very lightweight, never questioned or car lot of frosting so I could feel and see if you was would say the same things like that. And actually we also we also slicing through
this is really true friend besides horizontally as well. We actually have separate images video. Clusters that that's that. It makes it a lot easier to Wayfair gray cost-effective, and the other Elementary security. So, actually, the fuel air is very, very box. And it, basically, guess what you get it and does some competition and spits it out. But I can't talk to the lender that it cannot talk to the database running, a lot of different tools open source tools actually, in terms of the
image processing and it's not able to access the internet on a database because it has a very large new attack surface. Write any of those open-source libraries that and maybe there's, there's some vulnerability, There's an opportunity to take advantage of that and so we make sure to chuckle a lockdown to see if you were there because it's doing a whole lot more ground. So I'll they are right because it's it separated, right? We can be a little bit more and more free and open with that cuz it doesn't have as many dangerous dependencies. So that's Valentine's of layers lesson. Once you
start in, which I'm sure that many of you real sick since easiest, are all looking forward to. And I want to temper Your Enthusiasm a bit with this lovely quote from Terry Tempest Williams shards of glass in cutting wounds or magnify efficient. And I think this is very, very true that starting a database that starting a database both cuts and wounds, if it makes life very, very difficult ways, but it's also necessary in order to magnify the vision of your application to enable it to scale out. Like what even is starting out is probably
some of you have heard of it but never really into it. So he's actually a subset of of the general concept of partitioning. The more basic kind of fishing is vertical, has not called starting name. Is that each database contains different tables. So you have one database on the left with the red and blue. Table. And then we have the database on the right with the light blue and the yellow tables, if you don't perceive color, so that's that's fine. Just trust me that these are different colors and the idea
of vertically partitioning is to increase the ability to simultaneously read and write the same size wherever they are, but there's a load on each individual database. This was already supported in rails. 6.0. Then there's horizontal partitioning, which is also called charting, and this is a whole different level of complicated. So the idea is that you have one table which is distributed across multiple database. Okay, so, we can see that each of these databases is air containing the same information for the steam tables, rather. But but different different roles within
those tables and the goal of horizontal partition, to get to keep your table size under control. So essentially, imagine if you have a table that's, you know, really really incredibly long and who knows how many billions of Rose can get very, very hard. Even if you're into these are properly set up. It's still very, very hard because you're indices, get so big that it takes time to search your interviews. And so, the solution is best way to split across multiple, multiple databases. The challenge is that then you can't, you do, you have to think about all these situations? Like, how do
you join against the table? That's on multiple databases? That's actually really hard problem. And how do you, by the way, decide which record goes into, which database do I think these are all big question to happen when you chard. So, this is important and rails fix one, but let's talk about how we do things on your end, how we solve some of these challenges? So we have one man short for Apple ID to write. This is again in the in the general theme of vertical partitioning. So it's certain tables only live on that mean chard and then basically when you sign up for you got what's called a
cloud. It's kind of your own independent workspace. You can have potentially multiple clouds specially if you're a, you're a higher-paying customer and basically the cloud is kind of it. So it's sort of its own independent entity in its own world. That you can have different outfits with with the same name, as long as they live on multiple different clouds and that's fine. Now enter the hellions mundis. So every cloud lives on one of several stars and that cloud is 100% on those other shirts. Okay. So like it's if it's okay, if it's a short one Cloud, then everything about
that. That cloud is Uncharted 1. If it's a Shard 3, everything about that cloud is Uncharted 3. All right, so essentially once you know, which planet is Right ever plus I'm going to a server will talk to the primary database of database will say okay and we can work with chard three. Exclusively. So that's actually really nice. It's a pretty simple way to solve the problem of how do you spell few looks. But if he will be there between the different shards Cloud, we never had a problem. We're a cloud got so big that it needed to be split across multiple shards, that
will probably be the the follow-up talk. Maybe a couple years down the road. So, in terms of what it looks like in terms of the r code, so we have is on chard helper where you basically find a thousand requests to rounds. You call a cloud. On chard and then you got a block where you're doing your your work against the shore. This been specified. It's it's littered across everywhere in. This was not like, it's not an exaggeration. This is like a, a Graft in that in the end of town and its thousands
of references. And by the way, you'll have two other wise, you might not have your test and production really be matching up. So it does not trivial to work with shards. Definitely not trivial at all. The pro is that looks as good, a certain point. If you want your apps have any sort of reasonable performance Chardon, you know, starting is not is not an optional. It's not just like a nice thing. It's, it's absolutely necessary and the other the other really major advantage of flexibility. So we have the ability to say for example, there's a customer who is having a big spikes in traffic.
Maybe let's put them on on their own charge for a while or you know, putting on a chart with a few other than with your customers meaning of traffic. So at least, you know, hopefully they'll they will not have life with each other or if they do they're not, you know, hurting anyone else as much. The big downside is it's phone. So take a look at this bit of code. We can request for a going to have this. So I asked that's a local variable available on The Shard. We're going to assign those clouds. I need
any asset where we have this duck tag associated with it. And I'm going to render. That was about that. So kind of thinking like an API call by, this is not a delivery call. Now, you'll know that we're relation and then we call renter Jason access on the last line. We're actually going to a converted to an array. Thereby clearing. And so the query happens outside of the shark block and so it'll happen again. It'll try doing it. The primary database or it's not going to work at all. Basically, we're active record is really helpful and makes these trainable relations, but it
doesn't actually evaluate them. And so we need to be aware that when working with sharks make sure that the actual evaluation happens when you're on the correct Shard. Now that the fix is pretty simple. Just add a 2 a.m. That assets equals line. I wanted to load eagerly rather than waiting to leave the key on Char block, but it is something that that happens. Now, you might well have I got some news for you. So I kind of kind of a text here, you have an active record bass connected to and you's, that's why the role on this or that you're speaking
to you when you call person. All remember that all is a is a relation. What's going to happen is that when you called people, you know, afterwards right? It's already been pretty loaded. So as soon as you exit the orange card, that's why there's another connected to block. It's going to preload dad that relation. However, that's another hard to break the real thick one of them because it only evaluates whatever is the return value of the block. And so, here we are assigning this people, local variable to watch just to know about that. Really matter inside of the
connected to block. We're going to call people equal person. All they want to do something else and people's not going to be the return value. It turns out they're going to try to load people from the default Shard. So, you know, active record in real, 6.1 with starting support does try to sell it. Are you but I can't cover every case and I guarantee you as Lee guarantee you, if you do chard, you will have this problem sometime you earlier or later. I don't know, you will absolutely happen. And and you know, it's going to be a fun event to be bugging. Right, right.
That's all I can say if you do massage, but I don't don't charge it because when you really need to and then try to push the limits of your database before you you had are because it is a very error-prone thing and it's really hard to do it, right? Location or has every real estate agent ever says location location location. So let's talk about regions at cloudinary, how we put servers closer to the people that they are serving at cloudinary. We have three regions account. But we also have the, the European
and asia-pacific regions for those customers. That's relevant premium customer who really wants their service to be happening as close as possible to the European, or Asian Pacific regions. We have dedicated charge per region. So, that solves most of the other problem with you have dedicated servers shards, and that's all most of the issue. But then, there's a question about what about the primary database, right? Is the primary database asked me to talk to you, to figure out what to charge to talk to you? So we have a few options for
how we could have dealt with this. Right? One option is run three completely independent systems. That would be very painful and we would you want to do that. Another option is that that flower cuz you have to really go talk from pretty far away to take the database right from from Europe to the u.s. Region. We've not that either, that, that would be great. Another option would be to have some kind of multi primary database for you have a primary, and you have Primary in EU Primary in asia-pacific and that they would basically each other all the time. That itself would have
been certain degree of of knobs challenge. The more importantly, when the region were being set up initially. I think we kind of looked at the, the technologies that were available and decided it was just not a good fit for our needs. And so, we did not exactly real time, but it's close enough. Futile works. At the reason we have resolved before our server in our database. And actually there is a residence Inns that stands in the middle for some of our
frequently accessed models. And so when you want to get some information first you check does reticent a bit if yes, great we can move on. If not, we're going to then talk to the the database and and and check for that for that model and it was a cloud and then we'll just take an example that European region. It could be age of Pacific. So again server redis and we can talk to her but there's no primary database. So we can we can talk to you. And so instead it will talk to
a server in the u.s. Get information it needs and then cash it and read us. Now, when things have to be updated, let's see. If we updated information about a cloud inside of the inside of the database. So, we have a separate thing called stinker, which basically a job is to speak to the, to the primary primary reason. Write the US, 221, the servers there, and get the information about what's been updated since the last time we check for updates and then put that in to read us and Jen with that happens within 10 seconds of of any kind of update. So its sides pretty quick. I'm in
the same area, seems like a pretty sure for a system, like what could possibly go wrong? So on August, 4th 2019, we had a little bit of a situation, kind of across-the-board, increase in, not in error rate. I'm not to be clear. A big increase doesn't actually mean that feels like all the requests for failing a higher rate as I know something, like six hundred hours and five minutes. It's not actually very high necessarily but it was a significant increase in. And, and so we had a status report, what happened. So it was just a regular to fly. Everything seems like, I
don't know where the error rate spiked for about 15 minutes. We had this high again, comparatively speaking a rate, and the cause would be problematic migration. So inside of migration, he had something over each of the the clouds updated some flags, and then saved. And when you hit save, you update the timestamp on the cloud, which meant that Then what identify this as a model to update. And for all of our very high number of of clouds was with a huge amount of information to carry over to the other regions
happening per basically, per person over there and just still kind of gum up the works. Everything was was incredibly delayed. Nothing we could do. Right. Usually you have a problem with you increase the number of servers. No problem. More service will just mean more problem. So what do we do? You long time to stop this from happening. The first thing is, every time there's a Code change, we check is going to update clouds and some large number. And if so, we need to find another way of accomplishing this goal, but it is part of our core review process now
as kind of the immediate solution, but no long-term. What we actually did was we we've been migrating to a new and improved system is not happy. Same the same risk. All right. So location Pro is that multi-region, is really great for our customers. And for their users get there much faster service. The downside of this is it is hard to do, right? It's hard to support your sister. I am in a way that supposed to be safe and reliable and also fast. Okay, let's move on to D, duplicating work. And I love this quote from Scott Fitzgerald.
There are all kinds of love in this world. But never the same love twice. And with cloudinary systems. We, we have a lot of love for our. Our assets are videos, but we never want to love them the same way more than one time. Why might that happen? So here's what your the challenge. Okay. We have the movie shoes online store. They're releasing. There are model and it's going to, it's like the biggest new release of of shoes in in in the last decade. So like tons of you that are really, really excited about this and they want to do, want to see the site. They want to
get the information, but there's a there's a block up, right? Basically, there's like a flag thing and then do whatever it is. They dropped the block tons of requests come pouring in and of course, that includes requests for, for seeing the pictures of the shoe and all that traffic gets passed on to Cloud Mary. Do cucumbers have request for the exact same your else, same time. That's really the case that we need to deal with and so we have a system for logging and the goals are as follows. Number One never want to repeat a transformation. Okay, ideally more importantly we never want to
block a job and not at all. But again as much as we can never to repeat something and so the implementation of this is why I like to call a best-effort Locking System where like we try to have to come in twice, but you know, if you really have to make make a trade-off so we will we will prefer to see something twice vs. No time at all. All right, but just don't use for. This is something we both internally called Lobster, Lobster is a character in Norse mythology. You may know him by his more common name of Loki, and Loki.
Sounds like block. I didn't make up these names. Please don't blame me, but that. The system is called the other Locking System. So it's it's right here. Basically, it's talk to buy. The aisle are and it is written in Scala the way that it works, something like this. So basically when you I want to work on an asset first, you requested read lock which means don't write anything to visit to the block as hell. Don't don't mess with it. And then we create a record for the derivation of the, of the outside with the Transformers version and we have
to acquire a right lot. Before we actually go ahead and and generate the transformation. And so I locked music, whatever process. Hold. The right. Lock is allowed to the right, but it's exclusive right? No, one else can use at the same time and what this means is that multiple processes can read the same as in the same time, but only one injects in generator transformation at a time clients. Nothing's actually locked in any formal sense, what it really means. It is just that, you're sort of understanding between the juvenile frosties that if I all the lock, you can't
touch it if you want to walk. I will not touch it. Okay, so it's all the deportation of respecting. The lock is entirely on the client. 4 speed, we use an in-memory lock table and it works great. It's a pretty pretty simple system. Nothing too, too complex there. There are some concerns that it brings up. First of all failure to release. What if one process starts working on a transformation and then, you know, it's just maybe maybe the server just got done plug on the
Block. Never gets released. Right? So we always want to do it twice rather than no times at the general principle. And so, we have a timeout unlocked. So essentially, if you've been working more than x number of seconds, then the next processing line will get a chance, until we might be know if it's a very long running transformation. You might have basically haven't done a few times and then, you know, one of those wall, it will eventually return one after another after another after another and then at some point, you know, we're not going to do it more than, you know, more than that number of
times. It wasn't one of the first process finished. So then whatever is going to be like, oh wait, no, actually the transformation finish, so I don't actually have to do the transformation anymore. Another concern is downtime. What if you're Locking System just you know, I know she is down for a minute. So the answer is sort of funny. We pretend every lock request succeeded again following the principle of better to do it twice and no times at all. This would not work for something you need for data consistency, for example, but it works great for something that you really just need for for
performance and mostly trying to duplicate work. How do you spell you're locking system? So, we have a cluster where is requesting. Again? This is about no good, very intelligent client and wild. So the client hashes the information, basically, you know, here's the after I'm requesting, you create a consistent cash and that will decide which server it's going to want to talk to you within that locking cluster. I'm so, you know why? The logic is on the client which keeps the operative pretty like me.
The big Pro is resiliency, the traffic cir just like the one I described for the. Are you be shoes or Ruby? Shoot everyone to pronounce it if it's really great. And that's what that's what I used to self. We have all the time, write a news article just open. I mean, there's lots of things would go on the internet and immediately there's tons of traffic asking for that specific thing. The down side is that if there are too many prophecies in line for a walk and then they don't get released fast enough in there. Can be a doesn't happen that frequently though. So, honestly, it's not a really
big con, more importantly. It's not a hundred percent reliable, right? It's not on the level of consistency. It's a good enough system. Not perfect, but that's not to go, right? Cuz the same Works twice rather than not at all. All right, not scaling exactly like the coolest part. So so I'm I'm really excited. There's a lovely for what you do and when we choose to not do something that we've been asked to do hourly if you don't do it right away. Okay, the general principle is, we always want to limit the impact
of of an individual customer on the entire system, because you know, what, would you rather do with one satisfied? Customer or thousands of a satisfied customer? Cuz, you know what customers bikes through Tuesday, just takes a certain amount of time to watch respond to it to scale up here, your your Computing resources. It's much easier to talk to one to satisfy customer to explain the situation to help them. Correct. The problem, helping figure out how nauseous like, in that way to say, you know, hey, can you wait 5 minutes and then we'll we'll we'll do the spike most of the
time. Like, our system is able to handle, you know, spikes, right? We we we have enough give Billy there, but the only recognize there are times when someone is going to have a certain like in some kind of uses that is going to cause a problem. So we need to be able to to create a situation where I may be a few customers here and there will have a bigger incidence of of of down time or or increase latency, but it's not an impact everybody. That's that's, that's that's the idea
right limits on individual customers. If they hit a certain amount of upscale eliminating scarcity. So limiting scarcity means giving up our Computing resources, but because it takes a certain amount of time to do that. You would you have it perfectly fine to use? An automatic ews is a minimum of like 1 minute, 22 begin to respond to any kind of us yelling event that that act like they're at their minimum. Latency 22, Auto scalp. Just as an example. We have to manage scarce resources when there's a lot of demand for them. And so so we'll use a few interesting, basically
to do that. And of course, you know, we also like to do background background jobs, help us to to spread out the impact of the other bike event. Weight limits have Ubi called have a strict limit and then for again, I'd like internal thing where we have locking that that is. It's not like he's actually a rate limit per se but you know, they are called are not strictly rate limited. But at a certain point you will hit this throttling from walking that's using we got a call to its customer success and then we can meet me at the appropriate
channels weekend doesn't come up for most of our customers at the scale with her that they're dealing with. Now, in terms of the fair queuing system. So the idea is basically, every job is assigned number of slots and we had two cue cues mechanism to a lot. This lots of clouds. That's a lot of things I want to do this graphically. So imagine that we have a bunch of request that came in from different customers difference of a glyph. And there's no significance to Which Wich
glyph. It is just, it's just a way of making sure that even though you're color blind people watching soccer going to be able to understand what's going on here, but don't don't describe too much importance to which Cliff I chose for each color. Don't get so about to request. I came in and just basically handle them in the order. They came in. That's one way that would work. But I can get an unfair advantage to make Spike instead. What we do is we create a queue of Cubes to each. Customer gets thrown here. We put all those cues in one big q and then we can start processing them in this
year way. So we start at the top right? We process. The first customer is a job. Go ahead to the next one job, right? And you can kind of go back to the top, go back to the top and just feel kind of all the way down the line until we're done. And I want to kind of take take a step a little bit more complexity here cuz it's not like, like there's only one job in classes at a time machine. Imagine again, taking a very simple way of looking at it in Matteson that it's
who is the cooling system is deciding. What's the order. But then, you know, it's according to be whatever the next job in the queue, as, as you go, until we're done with you. If all the Joplin same size in reality, they're not Some jobs is bigger than others. And so we do is we actually assign a value to each job. Again. We be called a number of slots and we have a few radical maximum number of slots that we can effectively handle at a time on the machine until you go to the top with your job for our next job requires three.
Okay? Two for you. And then we say out, we don't have enough slots available. Okay, we need for slots for his job to run, away is all some spots for you up. When I can move on. We're just going to wait till some Slots free up and then sell them in with with this job, like the top again, same thing. And now we had an interesting situation. So this yellow customer that the envelope customer already has two slots for taking up there, asking for three more. We actually limit. The generally customer can only have up to 50% of the available
slots total associated with them and do 2. + 3 is 5. You finish up with the previous job on the line. So we go to the customer on top and they right there for them. That's fine. Just going to wait till I can stop there cuz they didn't finish. But actually the customer top also now is asking for two slots already have three. So they're also asking for too many and the basically bouncing back and forth until such time as one of the customers finishes their jobs, and then we can we can fill in some more slots. Now, I might be a little bit confusing as
to why were were insisting. Yo, you can't tell if he'll too many slots for One customer, but we'll see why exactly? Makes a lot of sense. So I can go to the top again. So that a customer at the bottom. The blue customer has another job that requires Force locks. So we're actually going to wait for them. We're going to wait until some some more capacity, gets freed up and then they're going to get first dibs on On The Floss. So it creates a situation where we are reserving seating capacity for other customers at all
times, even though we're not fully using all our cars all the time, but it makes a fair experience for all the other customer is picking choice incapacity to take him there, there traffic, and then we can go ahead and enter the rules we established earlier. Okay, this lives this I give the fair Q lives between the island and the, the fuel air and it is implemented in go. Selfie on some more. Go in, not in our system. Is one more element to this whole thing, which is that we prefer synchronous to asynchronous request in the background. We can try it later. Doesn't
really need to be done right now. So if we're having a spike in usually, again every makes it through the queue like basically immediately, right? We were there enough to pass any for that, but in a situation where there is a big spike, then if we are to come into place and then we say, okay, he's a feisty gotta back off for now while we while we still up. Okay, we also have by the way, a similar system for fear base access. So we have again, accuse accused not not with like the number of slots is not as anything as complicated as that barbecue of cues to make sure that
he's Cloud can't access the database too much of the same time. How is this month is one gigantic monkey patch that actually sitting up in the monks the, the clouds in this Warfare way. Okay. More classes in the main one. Background jobs. So the principals at anything that can be done out of Van, should be done out of an. So, for example, for unit, validations definitely be done in the background job. No reason to do that. As part of the thing. If your request, I'll give weapons, that can happen. As a result of the Buffaloes are
other other operations to. Look what happened in the background and we have your Transformations, right? This is kind of the big. Are you kissing terms of delivery flow, right? That the processing part of it doesn't actually have to happen. When you first expose, your your site's users or your image to you. There's also happening in advance. You can just tell us, I'm going to need this and I we generated, and then it's just their bathroom experience for users, and it definitely helps our own systems load. So a human
factors, this is what the last bit and there's a hilarious. The good news about computers is that they do what you tell them to do. The bad news is that they do what you tell them to do. And the idea is that we can scale by humans. If we help humans, tell the computer more scalable things to do. So the first prong of are basically two pronged approach to education. So we will encourage practices like e l generation which are good for our systems but also and there's the omen of relationships cultivating. The kind of relationships with with our customers understanding, their use cases
and how they might use our system in what kind of stresses in my place on, on the system. We create a situation where they're going to make a big change and use patterns and they want us to be ready. They just let us know about it. And we're happy to scalp in advance and make sure that we're ready for that kind of Spike. It's ultimately, this is all that we looking for the ways that we can help the customer while the customer House. Austin, and everyone, you know, has a has a better outcome from that. You have a really big customer. They're going to have you do a big spike. Like
if they tell us You everybody wins? Right? All our customers. Got to get a better experience. Customer has the capacity ready for when they need to do their their change and we were able to not have taken Beauty. Wake us up in the middle of the night. Let's finish up by talking about cloudinary on reelz, but I want to point out that this talk was not really a real talk that much most of our traffic, never touch Israel's, really only about 1% of the fastest parts of the system are not written in Ruby right there and go or Scala the competition. Have your parts are low-level
utilities or API intern cases. We have more like a I focus groups in the company has certain apis. That we also used it for a Transformations. The database killing is language Independence. I would have to make some changes to active record but the strategy itself is not really depend on a language. So all this to say that the challenge of the challenges of scale are challenges of scale. They're not real problems per se actually really really great. It's been a great way to build a system for the first about 4 years of the company
was founded to go for about the For years. It was mainly worked on by to developers, and they they develop already the core features of the system during that time, and it's been working for us since then has really great for creating interfaces weather is in the face with a p. I s with with shell commands, the level, you tell us whatever it is. It's a really good language for that. We didn't need to move a few things out of Ruby, but this is not like one of these stories were always regret our whole app and go to solve the problem. We moved it. The most performance sensitive parts you
anymore. It is a big win to be high by moving out of Ruby. We did that wasn't a problem. He can stay with Ruby, just got to be dogmatic about it, but we also need to leave us if it's been a great system. However, there have been certain challenges. So, I merely upgrading rails is always hard. There's a reason that every single conference has a couple of talks about upgrading rails and the challenges and how they overcame them. If you'd monkey patch deep into the internals of
rails, and especially, I could record it gets even harder. And there is no specific to Israel as the primary cloudinary R&D Center is in Israel, and it is actually pretty difficult to recruit Ruby developers here. There are not a lot of ways that Ruby developers get trained migrated over, but if they don't have the abundance of, you have an, in a world of a boot camps and things like that and developers in Israel, don't always want to learn Ruby. Necessarily a language that they think is going to get them
their next job. And so they might not want to take a job in route be at all. It's your last day. We have tons and tons invested in Ruby on Rails and enjoy. I think working with the with rails, but we are moving in the direction of bits of the big Rouse montelus into polyglot microservices. And that the gold is micro Services is not that they're going to be in written in the in the best tool for the job. So to speak, but really, it's more of a human thing. It's about building the app that our next employee wants to work on.
And rails can't do that. Then, you know what? Then unfortunately it's it's not going to be that however, much we love her else. So that's her that, at the same time, again. This is not relevant to you. If you are in a part of the world where Ruby and rails are still very popular, but I get the sense that the are in many parts of the world. And if you also, if you're on a globally, distributed remote team, which are becoming more and more these days. So if
I'm actually going to be a challenge for you and the, the main story in that case, is that a real castle, a really productive environment in which to work, we didn't have to get that much out of her for me or anything like that. It wasn't necessary and Ralph really is potentially a great Trail to scale without fail. That's what I have for you. Thank you so much to everyone who was with Partners process. Help me with that. The talk. Thank you to you all do for for listening. And you can see it aside from the NICU. And of course you can write
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.