Video
Anomaly detection at scale for performance engineers ‐ Tuli Nivas
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
173
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speaker

Tuli Nivas
Principal Performance Engineer at Salesforce

Tuli Nivas is a principal performance engineer at Salesforce with extensive experience in design and implementation of test automation and monitoring frameworks. Her interests lie in software testing, cloud computing, big data analytics, systems engineering, and architecture. Tuli holds a PhD in computer science with a focus on building processes to set up robust and fault-tolerant performance engineering systems.

View the profile

About the talk

Tuli Nivas presented on automated anomaly detection used in production. She approached the topic by showing how simple statistics can be beneficial not only to how long it takes to identify an issue, but also how quickly an outage is resolved. Presented in this session was a concise, rational way of anomaly detection. The complex topic and its math were clearly explained by the presenter.

Share

Hello everyone. My name is Julie. I am performance engineer Salesforce just like Jonathan mentioned and today I normally detection at scale. I am going to try and wrap this to look up as fast as I can. So there are there is time of the answer questions. I do get called on from time to time to look it up for adoption assistance and to help identify and results before McStuffin likes this man resonates with some with you here when I say that my biggest challenge is always been the data correlation part, right? So if I am looking

a particular performance metrics for a server or a pod as it is cold in the forest. Bingo how do I really make sure that I'm looking at anomalous behavior and that is not just a regular profile. 700 Pod without giving day of the week day of the month or the season and so on. So let me elaborate a little bit of not fighting look at a couple of examples. So this is the CPU usage chart for one of our pods and let's just assume that we found a really critical performance issue and we

identified that we've tried to resolve it and then we introduced some new code into the system Wednesday night. Lower the CPU usage for Wednesday vs. Thursday, I do see a marked difference. There is a decrees and CPU usage. But not how do I verify the decrees? And CPU is Ashley the cause of the code that was introduced into the system and not just a normal sized a profile for that server easiest ways to me to identify that and verify that is by looking at the load profile for that pod flavor profile is nothing

but the incoming transactions that are beans processed on the server and knife. I look at the Wednesday says they load profile. It's awesome. Similar. So now I can also see if I see is a child and really confirmed that yes, you know what the code that was introduced into the system really made a decision now, I can calculate the Delta and the percentage improvements and all is good a simple easy process to step process basically very easy to do. A slice of as a

weaving 10 service. Will you all know that I'll production systems of made up of hundreds and thousands of these Subs so not a very scalable solution. What's take a look at another example? Once against this is CPU usage data for one of our pods in Productions. A lot of spikes. A lot of variation has a felony for a few minutes. We just going to visually look at this data and I am looking at the Tuesday Wednesday Thursday is Eiza for the current week on this pod. Right?

I just feel that you know, what I feel that the CPU has increased. How do I verify that? I am going to look at Tuesday Wednesday Thursday that lights up for the previous week. And yes, I do see some sort of increase in CPU usage. Once again how you do I verify that this is a normal this behavior. I'm not just some change that happened on the server at this week. But this is the last week of free fall for the two weeks. And we just eyeballing the dye test. So we just going to say hello to Father does look very similar that means of Sines transaction raise the same

number of transactions that are being processed the two weeks. So why has my see if you increase this point of time I could actually end up with him this difference metrics that could potentially impact my CPU usage. Right? I know that I have all of these time series chart and basically means that I don't have to go back and forth for each one of these metrics look at the week-over-week numbers compared. My CPU usage. I'm going to try and find the performance was alive by Metric award

Buffalo knife in my system is really causing my CPU to increase not very difficult to do once again, so maybe if you service very soon, we realize it's not a scalable solution that we want them to automate this part is the anomaly detection process. So before we go into the main anomaly detection solution, I do want to give you an overview of the the valise and development life cycle of Salesforce. Standard life cycle development life cycle. I said 99% of organizations Fallout give you an idea

we've made when we came up with the solution. Like what is my training dicer? What's my test dates on and so on? Photo of course, we have three major releases a year. And even though we just have three of these is a year changes get introduced into the system pretty much on a daily basis because Daisy releases that go in or emergency releases that go in that was trying to fix either broken functionality or any kind of performance degradation that was noticed. So

we want to like whatever solution that we come up with for anomaly detection. We wanted it to be able to run on a daily basis is possible. We wanted to run real time because any time a change in the system introduces a degradation, we really want to know as fast as possible so that it can be resolved and any kind of customer impact can be shortened. Building that we want to do is with every release that gets introduced Wee Ones the performance profile for the remedies

to not very very softly from the fetus Belize right? So we don't want to change SL as we don't want to change a contract with customized. They expect the same behavior from the system, even though we've been to do some new release. New features that are being introduced with Scott new functionalities doesn't matter. We wants the performance profile to remain the same. So keep in mind the proposed solution that we come up with needs to satisfy these two conditions. I have to run daily if possible run real time,

and then it should be able to compare performance relieve over release. And if you recorded the two examples that I talked about in the big Nang my two biggest challenges were the manual data correlations that I was trying to do every time I was trying to really figure out and verify if I was singing almost Behavior. I was looking at the load profile for this about the conditions in challenge of the mind. All we did was went ahead and listed of the metrics that

we will use in our anomaly detection system not to simplify things to the very basic would have done is I am actually measuring sever health or Port Health in our taste by looking at CPU usage is telling me the hell's of the cerebellum visa-free match the metrics the cortical metrics. Say about hell or CPU usage. So these metrics were not chosen randomly or we did was went back and looked at a production historical data looked at all the incidents that occurred in the last year trying to figure out where City you degraded.

And then how do we cigarette Netflix was an indicator of telling me the CPU is degrading on the system and that is what we came up with the load profile of the incoming transactions, very important anything and everything that happens on the server is Ashley triggered by incoming transactions. We definitely want to keep track of. It's got a couple garbage collection Metro Seattle Ashley indicators if my JPM health systems on Java BAE Systems do for us. Portland metro to look

out. We've got a couple of metrics. Yeah because of metrics are an indicator of the oil operations that are happening on the Pod and thereby directly impacts CPU usage error messages. He just broken functionality for the times will also give you an idea of performance bottlenecks in the system if you want to keep track of these Ace metrix So the first thing that we did was we wanted to find out if that was any kind of relationship between these asymmetric.

I'm the incoming transactions on the pot. And this is what we came up with. This is just rolled ice Adam and here the album. This is basically the incoming transactions on the wall exist all the individual metrics away trying to measure the trying to find a relationship between the incoming transactions on these individual Netflix. Like I said, this was just rolled ice of them. We haven't cleaned up the diet or anything so there are a couple of exceptions but he

looks closely that does seem to be some kind of relationship and our relationship seems to be Linea. Say well. Mace of music so we can actually use linear regression techniques to base are normally detection system on. So this is the well-known linear regression equation don't think I need to explain this in a lot of detail before us the independent variable X or Y is basically the incoming transactions the dependent variable why I is the individual Netflix that I'm trying to assess with respect to the incoming transactions that the regression paramitas will be found

from the data, and then the error of the residual confidence is something else in just a couple of minutes. Alright, so now we are actually going to talk about the solution of self. It's a really simple three-step process. The first step is to find the distribution for the normal. So that is here on this slide is basically just CPA utilization versus the incoming transactions to my pod and the goal of linear regression is 2 fits this data with a line such thing as my volume

is minimal, right? So I'm actually going to run my linear regression code. This is the output for its this is where I actually guess my regression promises form from the r-squared value. At least for this exercise was .97 basically means that 70% of the dates of the regression line sits at least 97% of the data, which is good enough. So we've got our linear regression model. The next step is to find Sierra Sierra value. How do we Define a residual will the aravalli? It's nothing but the difference between

the actual value and the predicted value. So if I wanted to visualize this definition for this data, once again, this data is for transactions and drop CPU usage is basically means that I am actually going to go ahead and fits this data with my linear regression line. I pick up a datapoint say that's my highest elevation him and I calculate the distance between. Queens and V and Ashley my air about you right now. What is missing from all of this is any

sense of distribution if these wretches residuals, so to elaborate on that looks like for example, the error value here was fired for example, if 5 more or less and then remember this one metric so we are trying to use in order to detect anomalies. How does an air about 541 metric compared to an error value of 5 for another metric right address this dilemma we going to do is instead of using raw residual values going to use the score between Z's for is nothing but standardizing Value Place, we measure the distance of potato from its

mean in terms of standard deviations prices since this is a standard size and value will actually be able to say how much better is a dinosaur from its mean in terms of standard deviations. I'm not this definition can Ashley be used for all the metrics? The pool going to do the 3rd and the law stuff is this musician is basically trying to set a fresh oil has Elena is anomalous or not and to use the stuff that threshold we lost we use the 68-95-99 point seven rule will be easier to explain this in

times of a hypothesis test. So if I sent this as a hypothesis test and I said my null hypothesis was that a data point is not a nice lawyer a really small P value. I will be able to reject a null hypothesis right to give you a couple of examples, for example if I have a z-score value of 1 And I had to pee badly. It's a .32 it basically means that is going to categorize the days of wine and incorrectly 32% of the time but the corollary is also true that 68% of the time

the data will not be considered an outlier if I had a z-score values to and I had a p-value 0.05 the means that 5% of the time a dentist appointment will be considered in correctly will be considered an outlier but 95% of the time so fresh all there is always a trade-off between fraternity and coverage and at this point of time when we're just starting off with our Solutions. Only one from a algorithm is as many dice of points as many signals that we can guess. So if we're old examples, I'll be told in the back next

song Fresh old to be a z-score values of who won. So anytime I guess it's icy Point. These core value is one great. So I'll be going to categorize. Is an outlier otherwise not three steps now, let's see. Let's put all the stuff together and see the algorithm in action. This is the alpha in Spain on lie detection solution and filet. Alex is actually very similar to the one that we started off with where we were trying to figure out if that was any kind of relationship

between metric and the incoming traffic. So remember the two conditions that the solution had to stop by the how to run daily rise save the blue and the red dots that you see him. There are fleas daily data. The data is been collected every 10 minutes the really big circles that you see that the big green and red circles that is actually the elephants of the algorithm if the measurements, that's the current volume at which the algorithm was one. Please try to me. It's not really real time to the small running every second. It is running everyday.

Carnival time for us the second condition leather sleeves and how to satisfy. Where is it how to compare performance Willy's overly's is data as a training data on the blank line that you see the flag diagonal lines on these charts. That is my linear regressions V line for the previous release on the current dogs that you see the blue and the red ones are my test eyes. I was at the current release both the conditions that I started off with and the two top colleges

that I have quite so every time when I was verifying behavior in Cumming pray for that information is included in this chart because the act as is the incoming transactions, so I don't have to go back and forth looking at I'm serious charge for like what happened last week what happened to the previous month? What happened? Previous release of information is right outside of a relation is automated information is included in this direction. The colors

are actually, you know blue means that it's not an anomaly red means that an anomaly was Ashley detected as a Performance Engine if I asked me for something like this for a pod I'm seeing like my GC times about me going up but my GC Combs or Ashley down eyeliner from this part that it is really an almost Behavior. I cannot watch to go back to you the code and I try to identify why what is happening try to resolve that issue say that if there is any kind of customer impact so I can show

him up here at least in this example. There are a couple of errors that I lost me going up, right. So once again, I don't have to waste time going back and forth trying to figure out if this is really anomalous Behavior or not. I know it is and I need to fix this as fast as I can. We can extend this same solution to ranelagh level metrics example in this chart. He looking at different types of transactions that my pod is processing is the count. So the

number of times that particular transaction type was the second column is the average processing time. And then the third column is the cumulative processing time on the horizontal lines that you see the horizontal the Ping the yellow on the blue line that are actually the museum 25th and the 75th percentile value for the previous release the once again, they comparing the lease over release and example just very first row. I can see that this particular

transaction. What is not moving process more in this current release vs. The Los Feliz, right? So the king of times or up currently when I took the measurement Inc. The Big Green Circle. It seems to have gone down with something that I need to keep an eye on these transactions are being processed. The interesting observation tower is actually this lost row you do not see any of those horizontal lines. But what does that really mean? It means that this is a brand-new transaction type that was introduced in

this valise and did not even a car like did was not even present in the previous release life. We've all got these huge distributed systems. Like I said changes come in every day release is bringing major changes on other very hard to keep track of every change that gets Inducing a major release a good way to keep track of song. I know that this is a brand new feature or some feature that was enabled in this particular release something to keep rock often in a megabyte. We can

actually go one level deep and for a particular transaction type. I cannot look at some performance metrics for that transaction type for the transaction type that I'm looking at. I know instantly that my TV gets connection times of Ashley going up. Once again. This is Willie's over release comparison. This is David almost real-time evaluation is the metric SI know that this isn't trying to figure out why these connections Wings of Fire results at so I can

listen any kinds of either Parts performance Behavior impact or customer impact. Thornton sofa is we talked about this really simple anomaly detection solution also seen how I can extend the solution to the different metrics that I'm measuring we haven't talked about is how do I scale the solution in terms of the number of PODS and service that we have in production. This is a table that we've come up with. Now the table is actually trying to rank or prioritize service-based somehow on healthy. They are

approaches that have been combined and the benefits is something like this summary table like this is for example, you've got these photos in Posados in production and you've got plenty of the status of the old blinking red, which server should you look out fast? Right on. Is what this table is trying to on some so the first approach is basically just to count all the number if I was like as a real sound. So this is so me about 7 health. So we measuring Seven Hills facing those Ace metrix that I told to buy the idea

that I'm evaluating this time the anomalies what detective that's calling numbers 3 here which is the outliers. The second approach is to calculate the z-score values. So I'm doing here is to get column number for averaging out of all the non negative z-score values to remember my threshold to identify nightfly. I was a z-score value of one so negative z-score values basically mean that's that was some kind of improvement that happened. Time to Bob metric but anything higher one and higher basically means that it was and I was like, so if I went ahead and added

and all these core values asaba rash appear more healthy than it really is. So he just picking a negative z-score values and that is a call in number for here on the cause of combining these two approaches. You may end up with a pod with more anomalies run slower than a pod with pure anomalies another because if the z-score value was just telling you about some magnitude is a severity so is how we could steal this out to the production systems. So now what I want to do is I just want to take a couple of minutes and talk about another techniques that

really beneficial to us when we try to say load anomaly detection Solutions Plus ring. You may have noticed this is definitely the case for us, even though we have the same release getting deployed and Production service to the same piece of code running on the side of us has behave differently could be a multitude of reasons. So it could be the kind of customer that is running on the server is different. It could be because of the hardware is different. It could be because of the status of my apps via

it could be because of the website profile is different whatever the reason these others will behave differently and what we want to do. We will apply are clustering and classification algorithms and group the servers into little clusters. So the benefits of doing bad is. Now instead of keeping an eye on I'm on the train a hundred service. Now, I can actually just pick out a representative Sarah from these plus does the example like in this case is Got 5 + does all I need to do is look at those

5 representative service from these clusters and anytime. I know this kind of performance bottleneck or any kind of change in the system that could lead to a degradation in performance. I will know exactly which will be impacted the same way and that's basically will tell me what the customer experience is going to be full. Class stop. I find no like if you had this really be cussed him out running on iPod and the representative Paul Rudd from that class that is showing a performance bottleneck. I

Know It actually drives the sense of urgency to like fix that problem because you've got this really that customer sitting on the cluster. I need someone to you know, leave them behind Brian gets any kind of them anyway, so we'll be doing so what we did was we apply. Clustering algorithms to the different kinds of Hardware that we drowned in a production systems. We went ahead and plastered pause based on the different web player profiles. That is all the different types of songs that get processed

and all we did was we looked at the peak transactions of a running on pause how many Affleck how many hosts makeup on Siren Salon, we combine all of that and we got something like that so production pause and now I just need someone as a representative pots from these clusters and then I will be able to tell which Auto Parts in Sarah's will get impacted if I do find any kind of a performance issue on just one of those pods. All right. So what they actually come to the

end of the presentation and copy to on some questions if there are any

Cackle comments for the website

Buy this talk

Access to the talk “Anomaly detection at scale for performance engineers ‐ Tuli Nivas”
Available
In cart
Free
Free
Free
Free
Free
Free

Access to all the recordings of the event

Get access to all videos “CMG’s international IMPACT Digital Transformation Conference”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “Software development”?

You might be interested in videos from this event

September 28, 2018
Moscow
16
161
app store, apps, development, google play, mobile, soft

Similar talks

Kingsum Chow
Chief Scientist at Alibaba System Software
Available
In cart
Free
Free
Free
Free
Free
Free
Stuart McIrvine
Director, Product Management at Broadcom
Available
In cart
Free
Free
Free
Free
Free
Free
Allan Zander
Chief Executive Officer at DataKinetics
Available
In cart
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “Anomaly detection at scale for performance engineers ‐ Tuli Nivas”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
561 conferences
22100 speakers
8257 hours of content