Duration 41:15
16+
Play
Video

RailsConf 2019 - Scalable Observability for Rails Applications by John Feminella

John Feminella
Co-Founder at UpHex
  • Video
  • Table of contents
  • Video
RailsConf 2019
April 30, 2019, Minneapolis, USA
RailsConf 2019
Request Q&A
Video
RailsConf 2019 - Scalable Observability for Rails Applications by John Feminella
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
810
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speaker

John Feminella
Co-Founder at UpHex

I'm an experienced technology leader on a variety of complex distributed systems, a change agent for process and cultural transformation at large enterprises, and a public speaker on technical topics I find interesting. I've led, managed, or been responsible for teams of varying sizes, from small bands of engineers to a technology division with over two dozen balanced teams of dedicated developers, PMs, and designers. I love being at the intersection of difficult business problems and technology solutions.I've been invited to speak and share my expertise at over 100 talks in the last few years, on topics ranging from cloud-native platforms, to blockchain applications, to using machine learning for valuing stock options. I find it especially exciting to build things and share what I've learned with others.The kinds of problems I most enjoy solving are difficult infrastructure or process problems, like unusually large or complex distributed systems, driving enterprise transformation to the cloud, or improving a company's software discipline.Generally I'm very interested in:• building and leading efficient and capable engineering organizations,• helping enterprises transform how they deliver technology outcomes,• web services and architectures,• highly scalable systems and data,• creating and designing APIs, and• novel applications for blockchain technology

View the profile

About the talk

RailsConf 2019 - Scalable Observability for Rails Applications by John Feminella


Do you measure and monitor your Rails applications by alerting on metrics like CPU, memory usage, and request latency? Do you get large numbers of false positives, and wish you had fewer useless messages cluttering up your inbox? What if there were a better way to monitor your Rails applications that resulted in far more signal — and much less noise?

In this talk, we'll share an approach that's effective at measuring and monitoring distributed applications and systems. Through the application of a few simple core principles and a little bit of mathematical elbow grease, firms we've tried this with have seen significant results. By the end, you'll have the tools to ensure your applications will be healthier, more observable, and a lot less work to monitor.

Share

Welcome to scalable observability a handy guide for getting started with distributed applications. I'm John feminella. I work for a company called pivotal. I'd be delighted to talk to you about it forever. That's not the focus of his top. If you do have questions or comments or questions that are more of a comment than a question. I'm happy to take any and all of those. Please feel free to tweet me or slap me or ask me afterwards if that's the real sounds like I took today. I want to talk about the fundamental building blocks of observability as

applied to Applications. Bibb kinds of applications. You all probably were fine engineer design build chest operate, etc. Etc. And I think when people hear the word observability might be a confusing turn more and you turn to some people or just turned it in to get completed with other things like monitoring rometrics and I wanted use this talk as an opportunity to distill some of those ideas for you and also show you how they can be applied for your own applications on whether you're running 110 hundred or a thousand different applications, whether they're all well some of them

reals or some none of them for that matter, but I want to show you some tools that can help address the challenges that we'll talk about when I want to recommend a specific approach that I think will be helpful. If this is a business a new topic for you and you're just getting started with observability in your portfolio. Some specific advice that I think will be useful in this journey. Okay. So let's talk about those challenges that I mentioned first. I think when people look at an application that there was possible for operating they want to know question. Do you

want to be answers to questions? Like is this healthy? This is meeting all the promises and supposed to meet a my processing people's concert ticket request. Am I making widgets in the factory? Am I doing all of the things that this application for service or whatever it's supposed to do and it can be tempting to try to answer that question by measuring everything I measuring as many things as he possibly can to take one may be specific example of let's put some numbers on it imagine you have a system you'd like to understand the health of so there are lots of things we could

measure And so people might start off measuring things like this guy or how much CPU is being used. Your memory is there are lots of things we could measure for sure. And in fact, let's say that for this particular application. We record a hundred different things about this application for every instance of it. So if there are three in a recording things for every instance of an application, do we call one of those one of those time series it real recording when I call that ass Dream It's Over every instance of this application is a hundred stream all this stream

at 1 sample per second unless they would have takes 24 bikes for each sample. That's enough for let's say a 64-bit time snap a 64-bit number and some 64-bit identifier or metadata or something like that. That's 2400 bites per second per instance of his application. And that's 240,000 bites per second. That's about 7.5 terabytes. therefore if you were running a hundred instances of his application, so that seems like a lot of data, you know, it's probably more than piss on your laptop. But if restore balance as3 or some cloud service that's not very

expensive relatively speaking in a year or every year out that application. We want to store it do for an entire year. That's two hundred bucks. So for Enterprises improve larger companies that seems like a no-brainer. Why wouldn't we want to store everything we possibly could all the time about everything especially if there's a problem later on I want to be able to diagnose how that problem came about and you know that that's a valid question to ask why why don't we record everything and this option and philosophy in these places. I've if you can measure it and it's cheap to

measure it. Why wouldn't you measure everything you possibly could to understand how a system is functioning. But if it's worth asking a question are we measuring the right things and Crawford know if you imagine have a job applications and not rails for the job application is using spring boot. You're out of the box, you'll see things a whole bunch of endpoints get populated just by virtue of using Springer the favourite refusing a particular Point work. And so you can pick the mattress and point and say I gave to my application that measures the number of tickets I sold and I

want to record that in a myth that in some stream and I'm going to do that for all the events that I care about the applications not realize that this early but through the applications for the run time doesn't give you anything out of the box and turns it back to her various instrumentation and benchmarking things. You can do with rails application. So you get a few different choices of Europe is really low level tool and active support instrumentation that's about recording events that are happening inside of the whale swim work and as they interact with your application of Agent P

stooling things like New Relic or other things that you have to add as a chance for a binary to the environment for your application is running. Libraries that are inserted into the rack middleware stack for your application agree about what we could measure how do we go about measuring that what tools should be used to do that? There's another problem which is we are often faced with measuring the same things for applications that are not really the same as each other. I'm so just as one data point across the companies that I'm working with over

thought I have been working with over the past year or so. Most of which are Fortune 500 companies of the applications about 10% of those rails apps about a hundred and thirty 94% of them don't use custom mattress in 94% of them potentially accept whatever the tooling measures for their application out of the box and 92% of them haven't changed a setting up a new app so that you know get their configuration file generated for saying New Relic or whatever and then they'll add that to their application and convert that to Source control. Maybe make a few tweaks now and

that's a problem because if there's Variation and what we measure for different apps but there's lots of variation in the system typologies of these applications or in the system footprint and the cardinality the number of instances that we have lots of variation of a dependency graph of his application one bit. Maybe you have a model if it depends on a whole bunch of microservices. Maybe you have huge ServiceMaster 50 different things. I'll talking to each other. Those are very different relationships with chinos applications. And there's

also lots of variation of the runtime for the movie the thing we have to fly to staging to production. Maybe we have multiple versions of things running in production of a in the middle of a blue green deployment for a sample. So it makes sense that we're measuring the same things frequently for all of these applications and accepting as default. Wendy's applications are really really different from each other. Does that make sense to do? And even worse because we're all people and we're humans with me sad friends. We are prone to all the wonderful by it says to come along with that.

Right? So very frequent thing to do is to assume the correlation is related to each other is representing a call to a relationship than one thing causes the other religions in mathematical idea. If you imagine grabbing two different things and their relationship to each other we would say that this has a positive correlation because these things are as one increases. So does the other in about the same proportion a negative correlation would be as one increases the other decreases in the same proportion by things don't seem to follow any relationship in terms of

increasing the other may or may not increase or decrease. So here's a specific correlation between spelling test for is and shoe sizes do turns out that if you measure people shoe sizes to us sizes and you measure how well they score on a standardized spelling test me to ask him to spell 40 words of at least seven or more. How about with you? How many of those 40 words to make it right when you ask him if his shoe size is people with smaller shoe sizes are worse spellers, so I'm a size 10 and 1/2 US size 10 and a half

and I'm an okay spell her I would say. So hum out for Addictive is my shoe size of ride fall on this graph and it turns out it's it's quite your shoe size does to a large extent determine size is causing your spelling ability or vice versa know because there's a missing variable here, which is also bad as people get older. They learn more words and can smell better. And if your toddler has a very small shoe size and you can't smell a lot of words and if you are a 30 year old person you can spell forwards in a toddler can and

presumably you have grown a few suicides since the time you're here. It's not indicated are represented by the data. This is a very high correlation. This is 2.79 between those two variables specifically so you can imagine all kinds of relationships that aren't necessarily causing won versus the other six years the per capita consumption of Americans correlating with the number of people. Who died but becoming entangled in their bed sheet, so the more cheese Americans consume more people die by

becoming entangled in the bed sheets is that man that Americans are eating a bunch of cheese and Dad have been suffocated because they are getting entangled in their bed sheets. No one is not correlated with the other. It's just a coincidence that they happen to be related in that way another mistake we make when were thinking about metrics and how we represent things and monitor our application. Is it asking a lot of what if questions leads to false or bad Discovery? So a common theme you might hear if you look at the general scientific literature is people asking questions about

what doesn't doesn't make you lose weight. What would help someone that does eating some particular food or some particular magic Berry give you some kind of a super powers with respect to your metabolism and give people a whole bunch of different diets and eating peaches make you lose weight. 3 hour make you lose weight while eventually if you run enough to find something that coincidentally happens to Buck the trend, but it's not because that thing is actually valuable. It's cuz you test a lot of

different things in one of them turned out to be coincidentally beneficial healthy person. What time is Oliver studies related refuted when they ask this when they tried to replicate those results didn't work because you can't you can't replicate. And then finally the last kind of issue we have here is that when we are representing these really complex application for this really complex topologies with this really complex dependencies are often summarizing them and he's really simple how much CPU is that application using how much

memory is that application? Do you think I'm extrapolating from that to some kind of understanding about whether the application is healthy or meeting its promises or whatever we'd like it to do a data set of a hundred random points that are spread over a 0 to 100 range on the X & Y axis and hear some properties of X and Y averages and a standard deviation for X and Y. I'm in the correlation of you talked about before it has a relatively random data said it won't surprise you to see what the correlation is close to zero. These variables are not well correlative such a

randomly-generated. Now it's just a different day to stop at the same number of points arranged in the shape of a dinosaur. This data set has exactly the same properties as the other one did it, right? So those two things don't look anything like each other. Hopefully you can even if you haven't seen a dinosaur before hopefully you can tell that the thing on the left is soon as much more ordered and structure than the thing on the right but we have exactly the same properties rounded to two significant digits after the decimal place. So because of that it's

it's it's very important to be able to understand the overall shape of things and not necessarily have these metrics that reduce things to be single point or a single value. In fact, this is so important that you can purchase of that dataset overtime such that you're only adjusting the significant digits that are passed the last two after the decimal point and wind up with data sets. It looks totally different time each other yet have exactly the same statistics as a dinosaur and the random shape of you saw before. So what does it say about her applications when we

measure things like this and expect to be able to have a clear or Rod understanding of what's Happening? Tammy overconfident biases of our people fail human meat stocks that are prone to making these kinds of assumptions Let's talk about some tools that we can use to do that. The first one that I want to talk about if you aren't already familiar with the idea is the idea of a service level objective essentially a promise that your application is sunscreen service about how your application is way to interact with the user of it

upstream or Downstream service at cetera. And this is Austin broken into three smaller than ponitz are three separate components Silver's the SLI, which is the service level indicator of USA, which is the service-level objektiv. Hey, which is the service level agreement indicator is I want my application to be observing a hundred requests for second. I want I want people to be able to complete the transaction and 10 seconds or less. I want this message queue to be able to process

20 jobs for a minute etcetera, etcetera. Those are indicators. This is a fact that you want to be true the service level objectives how strong you want that so he promised essentially so if for example you are you why everybody's request to finish and 5 Seconds Did you got for a 99.9% of request then? That's the SLO. That's the target for that promise. Right. If you would just the SLO you're adjusting the strength of the promise. If you just the SLI, you're adjusting what you're promising. The SLA is the consequences for breaking that promise raped if you have a

SLA with some service or with some application Bend or Consequences when the promise is broken, so if you don't have an SLA when you're making promises, but there's not necessarily any consequences for the promise is going awry as an example. If you have an ISP that you get your Broadband from, you know you they may say hey we promised that will give you a hundred megabits per second Downstream and Upstream. That's the promise that's where I made a promise to deliver that 99.9% of the time that's the strength of the promise but there's not usually an SLA for

residential users. Right? So if your internet goes out like You know, maybe calling customer service and I'll give you a 1-day refund is hardly any consequences for the promise getting broken there a difference between those three application is I think it's crucial ice will see to think about your applications and your portfolio in terms of what promise is it is or is not making so what do you care about being true about your application and what you usually don't care about being too or things like it's using less than 90% memory like you might care about that and I sensed

it before but that's not actually what you care about. What you care about is that it's serving patient medical records in a timely fashion or letting people buy concert tickets or helping people book flights etcetera etcetera. So that's really the shape of a thing we want to talk about with her. SLO is not these low-level details. It will apply to almost any application but rather the things that are specific to the application or service. Show me sick about observability. This is a little bit of a nebulous term. I think in the

industry right now, but the way I Define it is that observability is our ability to understand a system. Give an only its output. So you think about like a printer has a very complex machine. I would say compared to a screwdriver or bite has a lot of moving Parts. Some software and Hardware today, etcetera, etcetera and what day that you send a print job to the printer and lots of pages are coming out the printer stops. Stop spreading out the thing that you wanted. Well, what's wrong? How do I get it to keep doing the thing that I wanted? Well, maybe there's a screen or something that will

tell me if I run out of paper and I have to load more paperwork tell me if there's a jam in the system and I have to go fix it and some way that's a very primitive kind of observability but it gives me enough information to understand. What's what's broken and how to fix it and what I need to do now. so that's what the objective you want to have for observe ability with our with our systems is can we understand when things when our promises are broken what we have to do to fix them by staying with the outputs of the system that we've set up already if we can only do it by

doing things like Attaching a attaching the Ruby Dee browser or by bus or something to my running rails application and understanding why it blew up at that specific moment. That's not really going to scale very well. When I add an application when I try to service a lot more request when I try to make us stronger promise. I'm going to run into trouble. I think crucially the lifeblood of observability this thing that helps us actually understand the degree to which our system is meeting its promises or not is events and events are really the currency of

how we want to talk about is her ability Tools in your observability toolbox. If you can mix and match and various ways to get a understanding of whether or not you're meeting us hellos. So with mattress when I say I'm talking about measuring facts and sometimes it's mystical Aggregates of those facts about the events that are happening and about the system overall. Metrics are good for things before it's almost always possible to get information by adding a match with that you didn't have before you're almost always learning something new weather that new

thing is just for not is a depends very much on the metric text example, if you measure CPU and you aren't measuring CPU before that tells you how busy the system is, but if your system is not CPU constrained that's not actually super useful most systems, especially most Wales applications tend to be Network IO constrained they tend to rely on some other Downstream service or something to eat a base or something else that that has to be fast before the application to make fast. It's another advantage metals have a set of keys to slice and dice the way that you've collected that

data to tell different stories as that suits your purposes. For example of information about all the tickets people are buying and how much they paid for them. And so on those are events and you can ask questions. Like how many people paid over $10 for a ticket? Maybe my ticket prices are too high. If not enough people are buying that kind of thing. How many people paid $0 for a ticket that seems problematic those people should I might have an error somewhere in my shopping cart? That's what I think. It is hard to tell what's important prior to knowing

when you're experiencing some kind of problem with your system. So people can adopt a scar tissue mentality. Will they will you know what, they'll have some kind of outage and those the application memory Pagerduty alerts every time the application gets over 94% memory, right? So it's hard to do that for every time you have a problem right measurements in 11 South aren't sufficiently robust to describe the overall picture of the system looking at numbers not looking at the dinosaur. Locking is a way of taking those at

the events that you have structuring them in a way that's easily machine-readable and human readable and also providing a stream. It's not structured. You know. Hey, I reach this point in the method that's a rare thing for really cheap easy way to Omit information at the point that is interesting and it's easy to develop specific issues add a add a massive steroid tells me about what's going on in that message. So if I see your problem again, I know how to

do this sort of same kind of Porter stuff that we talked about before his capture everything all the time never throw any Lots away going back a hundred year was right that might be important for some kind of Enterprises or some kind of businesses and there may be legal or right Tori requirement for them to store logs and evacuate this this particular point doesn't apply but for most applications most of the time it's not actually important that you store all logs forever. I'll till the end of time. You also lots of issues like accidentally

putting something in love that you shouldn't put in logs or putting the wrong information laws and things like that. Finally tracing which I would say is relatively uncommon and rails applications is where you capture events with some kind of causal ordering. So if you have a lot of multiple instances are multiple threads are multiple processes through writing to read stream. Then there's no guarantee that the order in which the events happened temporarily is the order in which they will be recorded in a lock right if process and process Peterbilt right into the same lock

screen overlap with each other and Flamingo. If you don't have a debit transaction all understanding of what happened in the system just by looking at law prison is different racing allows you to attach identifiers to to request or two transactions or two whatever is of interest to your particular system. Use that has two faces for understanding not just one event for one thing that happened but a series of events that happen if it's meaningful at the business level 3 sample may be in order to buy a concert ticket you first have to authenticate yourself you first Animation you have to do a

hold on my credit card information etcetera. The combination of all those events together might be passed through the system that eventually wind up back at. Oh, hey, here's your concert ticket. So you did one submit button click but five different microservices to bunch of stuff. You're going to want to understand if there any problems there how that bubbled up back to the original request. Turns out there's a great Boba standardized API called opentracing man in the blanks with the supplies. You'll be able to click on this and learn more about this particular

information about opentracing for Ruby and rails application install open Zipkin and Zipkin is has many different agents for many different run times. One of them is for Ruby on Rails application because it's anything else that uses rack and so that's a really easy way to add Tracy Middle where I'm in every request is going to get ahead of it has that's pretty nice did utility for reals applications that are essentially model is two don't talk to any other applications or Services because there's nothing to do with plastic trays

to that point. You just have one thing that goes and talks to me. But then it's done all the processing is happening in one system applications 10 to as compared to a job application. After this more of a center of gravity on Movie 2 playing one thing and having multiple things in your profile. Let's say have that had that TV activity that your application requires. The third set of tools toolbox talk about are the four golden signals. This this is a phrase that comes from Google's SRE book that is published about

10 years ago. What the score of golden state laws are essentially if you are if you're recording facts about your system and you didn't have a lot of room to choose what you could record. What should you choose and they just basically four things you should try for things. You can measure up. So first this traffic, how much demand is the system facing? What is the what's the what's the task to do? Latency is how long to request you to wait. So if from the moment that somebody fix a bit in their browser, how long does it take before they get a

response. It's important to distinguish between the latency of requested are successful and the latency of a quest that are not successful that's failing Subway. Surfers sample a HTTP 500 that triggers due to the loss of a connection to a database or some other back into that might get served really quickly. Hey, this is Down East to be 500. But but you don't want that issue to be 500 quick response time to contribute to the overall understanding of your systems late. That would be bad. If you counted errors as part of the system's latency because then in order to optimize for

latency, you should ask you to shut down your website. Saturation is how much more can the system handle based on what is currently doing system consumes to process request so many people think for a sample of their applications limited by CPUs really want to run on a beefy AWS machine 454 most applications that are web services or website usually limited by Network iOS or the network traffic. You can send your website the faster you can process request rather than do I are going to take a lot of CPU on the server side to render a

page. So important understand what's the actual limiting factor there? And then finally found an error has perspective. We want to do the 4th of golden signals errors, which is how frequently are request failing. What's the rate of requested her feelings? Either explicitly, for example, he has to be 500 or implicitly like you get a tb200 but the wrong thing gets returned orbis top of the supposed to get into the database didn't actually get rid of etcetera Etc. So when the protocol was fonts isn't sufficient to express any kind of feel your condition. You're going to need some other way

of expressing that there was a feel you can't just look at HTTP and answered of a case. Okay, so pretty much all together for a observability perspective how we should use each of these tools. I think and the approach I'm going to describe is in three different ways mattress or for measuring the real work that's done by your system and trends that occur over time prices are for understanding specific causes of problems across Services. When you cross some kind of boundary and loss of her helping humans understand what's going on and to provide a structured machine-readable event

stream getting started with observability all kinds of different apologies that you could encounter in some distributed system and it doesn't really matter if these are kubernetes pods or different instances where application on Heroku but whatever this apology is represented on some really nice dashboard that abstracts away the fact that these things are really messy or nerve connected to each other in strange ways and so on. So that's not really what does that that pleasing facade is not what a distributed system really looks like and this

is this is too reductionist of you is going back to the thing the dinosaur again. Still pissed at these two things don't look like each other so we can't use the same approach to measure them and I think the way to get around that is to embrace the inherent complexity the fact that these systems are different, right? We don't write the same unit. Ask for every application. Why would we we certainly don't like the same integration tests for every application? Why would we write the same or rely on the same mattress for every application developers need to embrace the inherent complexity

of whatever their application is and measure of a accordingly understand if the system is meeting his slos doesn't matter whether the system is simple inside or complex inside or somewhere in between. What matters is what are the questions we want to ask to understand how the system is doing? What do we want to know about the system? What's our window into the printer? How should we come up with that for each and every application? It's an art portfolio. So what would tell us the most about whether the system is healthy and meeting at

the silos and I think in order to do that, we have to consider a different kind of metric metric size for parts for different fragments to the first part is to Madness is similar to the similar to the golden signal idea. What's a little bit different in Fordham. And I mean, how much work is the system being asked to do so work here is one of your notion but specific to that system. So if it process is concert tickets, it's probably request for concert tickets. If it's doing video, it's probably the bits per second can emit is asking or it's being asked to encode and so on so that

notion of work will be very specific. Whatever application whatever microservice whatever thing is is is doing the work. The second is how much of that demand is being satisfied? If it is a certain this process and concert tickets ever being asked to sell 10 concert tickets a second or we selling tent what's happening. How much of that demand is being satisfied? Third is the efficiency with which were producing the output. So how well does that system use its resources memory CPU. I was low-level things to produce output so that you can do the system

as a machine been consumed this memory CPU disk IO etcetera and takes requests for in this case concert tickets and turns into telling you so how well is the system making out work and then finally capacity does the system have enough resources to do the work that's being demanded of it. So if we if we don't if we if we measure all four of these things will have a very specific understanding of a system that's based on the work is doing now cost you is that we have to understand the system to understand something about that specific application. I

can't take this and do the same thing for 10,000 applications in a portfolio, but it has to be you wouldn't say it was correct. If the system was issued tb500 returning issue to be 500 watts of his handling more than one concurrent requests at a time in the same way. We wouldn't we wouldn't say a system is correct. It was failing happened to Natasha Apple portfolio what's different about what's that system doing that the RSS ellos in many cases that will be a new idea for some teams were some organizations.

So let's look at a few examples of a promise by Brooke. So let's say you have a this case is a real example of a database inside a kubernetes cluster. It has about 1,200 cases the number of queries. It's a slow process and the output is how many Curry's it actually processes. In this case the it was a concert ticket website to gets about 10,000 requests per minute for concert tickets. The demand is how many I'd like you to reserve my tickets and be the output is tickets to get sold without negative. You ask been 90 seconds or less. If you are

running a patient medical records EMR system, the the demand is for patient medical request for help. What is how many of those raccoons are you returning in a timely fashion now what can be tempting to set threshold on these things? So you might say, okay. Well I want to make sure but I'm always servicing a certain level of concert tickets or patient medical request a teacher to the alert that says processing at least 10,000 record for per minute. That's a problem. But then you're getting into this guessing

you're being asked to predict the future about when you should alert on one of these things being a problem and I think that's the wrong way to go about it because it's too arbitrary just a guess that 90% memory or the 85% memory is the right thing or 5,000 requests per second is the right thing. What does she care about ABS hellos, but if you do want to set a threshold on something Rather than on a specific value to us what I mean by that imagine that you record the latency

of a system over time. She just take to take the time index for every single request. It happens and the latency that occurs. So what are they could be helpful ways to order all of these requests not by the time in Dexter by their value and now we can see a kind of a different picture of a system most of the Northwest bit and Miss lynnie orb, and it's not very tall. But then as we get towards the top or bottom ends if it's band we're seeing a real change of a slope of this curve and if you zoom in a little bit more to the end of 214, if you can see that this occurs right

about being 89th percentile. So if you were going to set an alert, this is clearly some different receiver some different kind of behavior, but the system is occupied. That's right. Battle not on an arbitrary number like a hundred using more than a hundred. Do you think that story to understand the rod data is how we're going to get the observability for the system? That's why a couple of scenarios to test to see how this idea of work. Do you say that in one scenario? We have a load Spike. So the number of queries we receive on our application goes

up by 20% efficiency stays the same process as if the resources for using a Prius went up by 20% degrees reprocessing went up by 20% and the amount of resources by 20% raise your hand if this is a problematic situation if you think this is something you would send an alert about Raise your hand if you wouldn't send an alert about this, right, so I wouldn't send an alert about this because our request went up by 20% and soda. So did the amount of work we drink that's good and we didn't change our efficiency. So that's also

good. So this is a totally normal behavior of the system lets you received was up by 20% by 20% and now we're using 60% Water Resources. So efficiently went down raise your hand if you if this is something to alert about Raise your hand if you wouldn't worry about this, right so it'll depend a lot on whether or not 60% is going to bring you to close to the capacity to system for your SLO those it's you. It's your SLO is 50% higher than been this number and you're already at 95% of capacity. You got a big problem like that is

but if you're a solo is 50% higher in 10% of your total capacity. No big deal. Scenario 3 Let's see if the demand is going up over time. So say the amount Aquarius received was up by 80% efficiency stays the same this goes up by 20% and the usage goes up by 20% your hand if this is something you would have learned on. Raise your hand if you wouldn't alert on this. So I would alert on this because the Aquarius went up by 80% but the output only went up by 20% So we're dropping a bunch of these things on the floor or not meeting Russell. Oh and some way so it's not that

other metrics aren't important is not the things like memory disk IO and see if you were not important but it's important to optimize for the people. We trying to solve the problems as and understand the state of his system who was looking at the window on the printer to understand what they need to do next 100 firms out of that date, but I talked about before flight the strategy to not just the rails application portfolio, but to the broader Swap and their portfolio, they saw a 98% reduction in

false positive rates for applications that are using a strategy. They saw it mean to 15% Improvement in Corralitos outcomes. Meaning that when they get a pagerduty alert, it's actually about something important facts about an SLO being violated not woops allosaur. Offset a alert on that thing. It doesn't really matter. Now. Everyone's getting paged about it. And then all three of them maybe most tellingly have expanded this approach to induce if there's not just the existing ones by this to Legacy in previous times. So I think that's a really powerful statement to make a difference in terms

of real real world systems real world company's ability is not a silver bullets going to require a tool that you can't magically install some library and get observability but no Jam install observability. Now, of course, I was going to make a jam called observability and ruin this joke, but but there isn't a talk there isn't one but and it requires it requires a 12-sided doesn't require just picking the right numbers are guessing the right values requires a holistic understanding of your system. You have to orient what you capture around

the promises you make around the SLO that your system is responsible for if you don't make promises to anybody been what what's the point of your system? What is it doing? Right? It's not just a hello world application has no real substance abuse his value or you don't understand what your sister is actually doing. Both of those are are bad places to be in if you're a business and it's really hard to get this right about now. That's a Lowe's or not only the way to measure them, but but it's important to have a starting point and I think the approach I just wipe before it's a place

to start so it'll rain on that and you'll get to you I think a happier place in terms of the outcomes you want. Thanks very much and enjoy lunch at real stop.

Cackle comments for the website

Buy this talk

Access to the talk “RailsConf 2019 - Scalable Observability for Rails Applications by John Feminella”
Available
In cart
Free
Free
Free
Free
Free
Free

Access to all the recordings of the event

Get access to all videos “RailsConf 2019”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “IT & Technology”?

You might be interested in videos from this event

September 28, 2018
Moscow
16
166
app store, apps, development, google play, mobile, soft

Similar talks

Ylan Segal
Staff Software Engineer at Procore Technologies
Available
In cart
Free
Free
Free
Free
Free
Free
Karl Entwistle
Senior Ruby Web Developer at Cookpad Ltd
Available
In cart
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “RailsConf 2019 - Scalable Observability for Rails Applications by John Feminella”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
577 conferences
23312 speakers
8705 hours of content