In my current role I am responsible for leading Product Management for the Broadcom Mainframe Security and Intelligent Operations portfolios. This involves developing a strategy that ensures each product maintains strategic competitive advantage and meets the requirements of our customers. Prior to this role, I was Director of Product Management at Hewlett Packard Enterprise in the IT Operations Management business and previously the Enterprise Security Products business. I also spent over 20 years at IBM in many roles including Strategy, M&A, Product Management, Marketing and both Hardware Engineering and Software Engineering. My areas of expertise include Product Management, Strategy and Marketing. I have a Bachelors degree in Electronic Systems and Microcomputer Engineering and an MBA.View the profile
About the talk
Stuart McIrvine delivered a great presentation on where AI can and should be implemented and where the biggest impacts on IT operations can occur. He does not envision a fully autonomous operation, but instead talked about the application of AI for predictive analytics and automating certain tasks. Stuart’s presentation had a lot of key takeaways.
Transformer carbon I've been running for a for a long time that a long time tired. I really am I've been in the industry is full of respect to you. I hope you stay home with that broadcom ca broadcom ca about 3 years now before that. I was with the HP with a house with HP when it keeps splitting splitting splitting Adventure after all this time ago right before before I end up being one of the little items that get split during that whole process that was there for six years. And then before that I'm the
largest part of my career was I was with IBM so I was with IBM 21 years. So right when I was a little baby that starts with IBM all sorts of the jobs over there, but you know, I say, I'm tired. I really enjoy. I've loved, you know information technology and it still 10 used to be a fascinating area my responsibilities extend into both security. So I'm focused on Mainframe security right now have the product management for our Mainframe security products as well as the whole intelligent operations space. This is what I'm going to focus on today.
So, but the first thing I want to just have a little blurb on is it is broadcom ever heard of CA Technologies lovely. Thank you know beforehand. Okay. Okay, that's pretty good semiconductors a lot of devices broad, certainly very big in the Datacenter which makes it very relevant. They see the growth of the data centers. Ryan the devices that they build and sell but it really so obviously see the volume the date the same type of the sea value in software. So I wanted to get into software. They saw
what we were doing. I with Mainframe they still as we do I keep saying we and we and they, you know world a big we know the broadcom ca we volume 0% of the volume the software but particular at you no interest, you know clearly in the Mainframe and that's a space, you know, I've been around for a very long time. So this was really triggered the investment here. It is been great. And I'm not just saying that you probably think that you know, you know, he works for the company to go to say that it has been tremendous as we've gone through the early stages of
the this acquisition to see the level of investment. They want to put in Mainframe that's good for us. It's good for me me to product manager, you know, I get to spend more money on my products, you know, but it's also a tremendous for for our customers to so good so far, you know, that's come back in a year and I'll tell you I would say I was going then so this whole area of intelligent operations right? I didn't let me let me just throw this year. Let me come and just start off with a little bit of the whole
problem space, right? You know, what we're dealing with this comes from you know, baby is sad different reports and surveys Etc. But the volume of data that we have to get through to understand the law. The operational issues that we're dealing with his enormous is growing and I depends what you know report you look at but it's growing at 88% a year and then through some of these reports and surveys the average mean time to Recovery in Elizabeth four and a half hours. Okay, these are you know typical large organizations, but that's a problem. I have one client. I
can I can't really mention their name, but I was talking to them fairly recently and they were they were telling me that on their Mainframe an hour of downtime equals six million dollars, right? So four and a half hours mean time to recover. He's a very very expensive proposition. So we got to get that much much shorter time in some of these, you know outages are big I think 250 million were talking about Your Delta had a big otisha know that cost them a hundred and fifty million. Even Amazon Prime Amazon Prime airplane d o t g lost them, you know hundred million dollars
point trying to get this is difficult and it's expensive right? So that's what we're doing or why is it difficult while you probably all know right? This is what we've Luke late today typical large organizations right to highly complex environment or dealing with hybrid it note hybrid cloud is really in a combination of public and private Cloud hybrid 80. It's everything right? You know, it's mobile. It's on Prime private closed its using public clothes,
right this all different types of systems that we use and I just look at it there, right? We've webapps connecting in their touching structured and unstructured data. We've got physical servers we Virtual servers at santra, right and you know and all these, you know millions of pesky mobile devices coming in generating all this network traffic and you know, just just like the network traffic you'll be generating right now is you're tweeting about my session. It was the any bad new. Okay, but you know, this is what you know, these mobile
devices to all this network traffic but there are so many different domains that are being touched here and it's highly complex environment Right Way Beyond does a human's ability to kind of process oldest son really truly understand and this is why you see these mttr problems Etc. You know, these kind of long and TT are at time frames. So kind of that whole complex hybrid ITI invite him and they are think of other aspects to it as well. We've got other things like devops a Nigel Wright. We're trying to put you out in a
natural fashion a very frequently new Absol a time. I think I think Amazon poops Apso Jade's, you know, multiple app updates Purdy right now. I'm in a change of my head around that one. Right? But you look at the velocity of the Datacenter underway architectures have to continue to change write the weekend of all these domains continue to change as I said the volume of events the different types of data that we're having to deal with having to bring an expert's i all the time to get to analyze this right. I'd bring an e-cig first they're tame is expensive. We're
taking them in this is not their day job, right? Analyze eating a potential life dishes are outages or root cause analysis etcetera or taking them away from their day job to be able to do this right with all this complexity the velocity of changing of the volume of data everything else. Why do we how do we do this to be right all the way in which were handling this today as we have our experts. We have our monitoring tools that sit inside all their specific demeans, you know, you can see here. We've got, you know network monitoring tools. We've got
storage Monica two systems monitoring tools Etc. That's all well and good. But the problem is in order to find out where a lot of the problems are or understand that we even have a problem in the first place. We have to understand perspectives across multiple domains. It's not just within those to me. I need an example with one of them is even within the network demand UCF but listed their fault and performance, right? You know, I get people looking at fault analysis and then people looking at Network performance right way to pick a simple situation, right? I do have a network
folder. I'm just in a road trip tonight, for example Another writer. So the performance guys are going I'm starting to see Pooh response things all the traffic's been through one writer to start a point to that right there is nothing wrong with handling the traffic from the other one that broke down, right? So he's having to understand the big picture and not just within one domain right now and start to build up and Luke now go to look across other means I've got the applications right there transactions write these transactions are touching, you
know, you know all their applications and storage in databases and everything else, right? So when I start to see application response times for a date on the experience analytic very important that we are looking at I'm and users for the response times are experiencing right because you know, we've all become extremely impatient, right? I was impatient as our kids, right cuz they grew up with these little devices that give them everything instantly and I know Give him everything in spin the instantly right, but they were very impatient response times
are critical in this is this holy 80 of experience analytics Rite measuring the you know, trying to understand what is causing that response time is the database is the network is it just pure application performance at cetera, right? So that is another area we've got on quite often when we hit these really tough problems, right? Yes. We understand what's going on the network. Yes, we understand what's going on from an infrastructure user experience perspective. We still can't find out the problem. So what do we do? We stop a bridge call. We bring everyone
together, right and we try and figure out we all the experts right I talk into each other again. Just like I said before wasting a lot of their time taking them away from their day jobs to try and figure out exactly what's happening, right? This is what you really need to get away from right? You know what these experts kind of do their job. So where do we moving to we're moving to this is really kind of going coined. I should say like dark Ops, right? So it's it's bringing machine learning artificial intelligence together
right across the data right from all these different domains right having the ability to correlate Carly the data across these domains, right and gain Insight right understand understand using machine learning in artificial intelligence to truly understand and the root cause of the problem. Not just what we're doing today is I can see a whole bunch of symptoms, right? But try and understand. Where is this? Where is the root cause actually happening right now? I'm going to go into some you no more detail on on on high. We actually have you actually go up by that cuz it's you
know, a lot of it is a process. Write this is something that has a mansion. Gardner & Gardner is dealing with frequent client requests in this area. I think they received well over 5 600 client inquiries in the specific area and is Gardner doesn't is Gardner does very well. They're making predictions that buy, you know, 2022 by 40% of organizations, right? We'll be using it a octopus intelligence, you know in this particular in this particular space.
Worse that's these this comes from you know, Sloan business school as well. As I think there's an adobe document as well. But just highlighting that, you know, pretty much most organizations, you know are investing in kind of the building and launching in a big data, you know, artificial intelligence initiatives and nearly half of large organizations at today actually have very well-defined strategy in the stadium right now. What is the best kind of intelligent operations right? What is what is needed in this particular idea? So the last time say it
is what you know, I've been talking a lot about is all these different domains right with go experts in each one of these demands weather systems Network application. Lalala a Santa right in there and we'll be dealing with a bridge calls Etc. So what we're doing is these domains produce Incredible Minds are very useful data. They also produced Incredible Minds of useless irrelevant data on let's call that noise. Okay. So the whole idea is to bring a lot of this data in first stages to really
Carly right, you know start to normalize a lot of the date of this coming in from the different places different types of advance and everything else try and get to into you know, I'm meaningful said that we can analyze The ability to connect correlate all these different events Etc. Right. The other piece is to start a deriving insights, right? So then you know, we're starting to look at, you know, we pulled all these events in right now. I'm going to start seeing you know, I've got to start focusing at some of the noise here, right? I got to start identifying which something
going to some of the more detail when we can each of these areas, but which ones are relevant which ones you know, which ones are just noise filter out the rubbish right lights a and start to derive insights from this data that died that I'm truly Gathering never could apply contacts know what is what is that mean so context, you know, there are many different contacts here right eye. Some of it could be. Okay. I'm going to look at this particular area at her and missed it and domain against what's going on in the other two
means it's not looking at the main. An isolation. So that's a big part of the context other parts of the contacts could be just environmental conditions that I have and I just deployed a new app, right and things went nuts right in the past. Sometimes I deployed in you out before things went nuts. Another contact could be the time of year, right? I'm a retailer knit Black Friday. Okay, or I'm a retailer. It's the holiday season, right? So context is critical here because on some days of the year or based on some events that have happened right? I should expect to see some
spikes right or know some things that can be normal. Sometimes it could be abnormal Etsy a deployment app and it's a bad up right but I seen this before and I'm telling you know, when I saw these certain things happen, you know, it's looking at your Gathering a lot of the historical data and understanding it out when it happened before followed by bee. Then see happen and everything when you know in into the toilet, sorry into the trash can but you know, when I saw that I want to start looking at ways happened to bees happen. She's probably going to happen. I need to start
self-lighting some alerts and things like that right identifying where some of the problems are so contacts is very critical, you know been able to bring all the alerts together get some insight from that even deeper insights by by applying the context and then of course the last area is, you know, as we start to get much more of an advance starting to help remediate some of the problems right off to meet some of the action know this can be relatively simple at first, right? Nobody wants to be to be really got
seeing you know, how to have this completely self-driving system today because one we don't trust it in certain, you know, basic situations. We may want to kind of To meet some action because we know when this problem happens it definitely this kind of situation and this is the typical scripts or whatever we run in order to remediate that and you know, you know as our information as our data as our insights get better and better over time. We got a little bit more confidence about some of the automated action that that died that we actually take. What I want to do is just go into a
little bit more detail in some of these areas and how we do it etcetera. But before I do that, I like the So I like this whole idea of the you know, the the self-driving car cuz it's actually a good analogy the maturity model arrive in Hobart evolving the self-driving or that, you know the car from being with manual thing. I like I drive today cuz I drive a pickup truck, right that's not exactly self-driving and then, you know all the way through maturity model to truly
self-driving car. This basically comes from this site of automation engineer. So this is their definition Cena. We live kind of bought them in the scale. You've got stripper carvings truck right pretty manual manual transmission probably driven cuz I'm of the wheel back in a thing right then we get onto kind of level 1 4 You've Got Mail blainsport warnings and stuff. You know, I got to say this is one that I actually do not like blainsport warnings and I like don't like it for one reason. My wife's car has it and when I get in there I get too used to having
it and then I get back into my pickup truck with doesn't have it and suddenly, you know, I'm going to be the driver in anymore. Right but good things like that, you know any kind of paint should you shift gears? So that kind of the next stage the really I look at that is in the first stages manual next ages. I'm getting notifications to kind of help me be a better driver, right and then we start to move to level 2, which is always kind of partial automation here right with God. To cruise control, you know, so I'm driving along and suddenly to text in the car in front of me is slowing
down a little bit more right, you know emergency braking, you know, this is you know, you just have to you know, read that text message to write that text message right? I'm kidding right? Seriously. I was putting a joke, but but And stuff like that, right then we start to go on to level three, right? This is where we start to get a little bit more than just conditional autonomy here right now is a car starting to the park. It's alpino give me a lot of assistance on the highway. You know, when I go riding band just adjusting speed and things with that level for you know, this
is we're really for free to find Roots the car can drive there anyone seen any of these cars you've been a big as soon as you seen these autonomous cars at Riley still have drivers and then all the controls, right? So the driver can take over but for certain, you know, I'm going point-to-point and she pretty good that get in there. I don't know if I'm ready to go into one yet. But is anyone here and we got Tesla? Oh good. I don't I don't need to be resigning for jealous with any of you but one of my colleagues at work as a friend as well as a colleague
the last year and he took me out in this thing. And I usually I'm not impressed with technology. We are going to live and breathe technology every day of Our Lives but that's impressive car. He was driving along and you please turn signal on and that car waiting at Tesla waiting for a guy can the traffic and then just moved itself out. He was even touching the wheel you believe that is fairly far along in the maturity curve here, right? Then you go .2 and then of course the autonomous driving that truly autonomous, you know, when you talk to big car manufacturers, you
know, we're still at least 10 years a week away from that but that's for you don't even have to control you can take over the car is doing everything itself. You know, he put in your GPS location and Adolphe you go except that's your car. An inexpensive good model, right because you know underneath that. I'm looking at you know, where we are with you know what i t operations Rd to say interest. Hi, we're managing that and we want to really get to in the future right having this really eventually this very autonomous Datacenter so reports
that looks like but this is you know, the swivel chair at management, you know different screens for Network management and all that kind of stuff still kind of Carly across the domains when we get to level one. That's when we start to get things like we can detect anomalies right we can understand, you know through intelligent and operations when things are going wrong that are starting to grow long and stuff like that. Right? So it's basically saying yeah, something's going to go wrong here. We're starting to see some a gnome anomalies level 2 is really starting to get to know only
do understand something's wrong, but I can help you find out what the cause of the Problem is it's not just telling you about a lot of different symptoms right that that that are appearing level 3. This is no starting to get you know, what we call learned a remediation rights to the system learning learning from the experts too and I'll talk a little bit about that. Right but he's also starting to kick off some remediation. So it's not just go to find you some is going to go wrong. Right and it's doing a little bit more than the root cause analysis is
actually trying to have some simple cases, you know remediate that some of these some of the simpler simpler this level for area this really kind of the self-healing operations that starting to know look at. Okay, so I find the root cause for you, you know, I've done some remediation for you. I told you he was with the root cause you've done some remediation. I'm going to learn from that because I'm going to loot at the logs. I'm going to look at for you did to remediate this problem and I'm going to look at the outcome. I did it
fix it. Right did my performance start going back to normal? Right did my response times and prove all these different things? So the system I'm going to learn from that. I'm so I'm taking the outcomes now and it didn't work. So well, okay that that's not going to be something I log by but really kind of love learning from that using the outcomes to understand I'm I going in the right direction should I use that as a recommended remediation the next time Etc level 5 long way off from my from level-5, but I really just kind of your light. So this is your light so
still running data centers, you know, and you know, we're all at jobs Merlin or tired. I know you're right at that point. So so that's really kind of looking at the maturity model for this against what we have with a self-driving car. Threat Level zero. I talked about this is where we're very manual write. This is looking at the Sea of red, you know, all these large come up. Most of them are noise that you know, I think I d g through a report that said 31% of all the states that were looking at is Just Pure Noise is absolutely irrelevant.
Right? And the problem with that is it just affects productivity, right? We're looking at noise and not only is it affecting productivity go to figure out your is this just noise or is it real but we're getting desensitize we know this is true. Right? Everyone's looking ever seen these noisy events and it's taken or I away from some of the real problems or it's saying some things when you see the real problems, you know, this is probably just more noise write a novel of this is affecting, you know our meantime to to recovery No, we start to get
up to level one here. And what we're doing in this space is replying some algorithms know we got a couple of algorithms but the key here is to start collect historical information to sell a profile for what your data center looks like for what system performance looks like at cetera right to really kind of understand this profile cuz this is what we want to start measuring against right to understand when we have certain anomalies right to stand or behavior. So this is kind of a standard Behavior lights a profile
of a couple of court in here. One of them is really what we call a exponentially weighted moving average. That one is critical because you don't necessarily have to want one to wait, you know, 6-8 weeks half a year to collect enough information to build this pool. Right. So there's some some out-of-the-box capability you need and this was looking at really volatility. So it's kind of exponential volatility. That's okay. But if you see some some, you know, exponential changes based on that volatility use probably
starting to have some kind of problems. You don't necessarily need all this historic data to to get started here, but wouldn't get you away from these horrible static specials Millwork at least attic threshold set today. Once I start to go beyond a certain prop response time, you know, everyone right trying to figure out what this problem is, but as I see it could be Black Friday, right then, you know, this could be normal for several hours and things like that, right? So that's one part of it and there's another you know set of algorithms. We we use which is
compliments this and it's very much looking at a historical data. So this is where I see your bill. Hang out this profile you're adding in things like, you know, Black Friday the holiday season right certain times of year, you know where you know, you get big orders coming in and all this kind of stuff large sales volumes right to it's building up your profile. Right and I'm starting to understand. What is a normal behavior GameStop profile again get you away from the static pressure setting more Dynamic thresholds so that we are not dealing with noise or
unnecessary events all the time, but we use years. We actually use the Western Electric rules this chemist Lazy River quite simply shows that I'm horribly colorblind, you know, basically change the colors in this but I hope you're anyone here anyone in here color blind. You are as well. Can you see the different shades of green there? You can't know you were a lot better than I am that. Yeah, there are absolutely right. You know, I'm going to every type right I'm just horribly colorblind. But but there are dark green band in
the middle of this is quite good screen. I can see it reasonably. Well hear what we're doing here with these Western Electric rules. If you're not familiar with them is that can a black line that's running through it that kind of your mean behavior of your system. This is where your performance goes across the timeline different times of the year at cetera write the dark green band is closest to that. That's a writing that that is the one okay, you know, if I get all my points, my performance points are within that ban. I'm I'm probably pretty
good right, you know, that's my margin of error that I'm prepared to deal with it. If you know their kind of the adopted thresholding that next lights are green band. That's the no wider. Okay what happens if I have two out of three? My points are in there. I'm probably going to flag an alert right because you have to at 3 and it's starting to go to V Uruguay outside of that big dark green area and probably a problem if I have two or three into the next area, right then I'm going to flag an
alarm and if I even one Liquors into the white space, yeah, I'm going to start Lee flag alerts, right? So this is basically how were you know, highlighting these are normally so it's a combination of understanding the history of what's going on for your organization, right and then starting to look at how far away from the mean right? I am I before I start in a flagging alerts and things like that. So that's one key part of what we're doing there. Another thing is critically important is reducing this noise because you have so many events tonight quoted that figure for my TG
31% of these Advanced irrelevant write another part of what are all good. Algorithms are doing here is it is just noise reduction is looking at home when you get a whole bunch of duplicate events, right, you know taking them away and it's the same thing you just use producing text analysis and stuff like that to removal of duplicate Advanced right or symptoms of the same problem. So a lot of noise reduction there if that's what's one of the things that's one of the things that's critically affecting, you know that the product productivity. I'm up all of us. So
that's a big for a hear. Something is going on here. It is a game starting to Luca cross domain. So it's not just seeing a whole a whole bunch of duplicate events. I've got a networking response times. Storage problems right application problems all happening at the same time. I can probably cluster them together right and highlight, you know, I probably get the same issue here to try and kind of help so it's enriching some of these events right into noon o d i c a incidents and stuff like that. Right? This is a big part
of reducing the noise so I can not just get rid of all the unwanted advance, but I can create start to look at where do I see real problems in multiple to means that are highly related highly correlated and get you to focus on them for right witcha set, you know a big part of that of the noise reduction as well. Then we start to get into you know the root root cause analysis write this. This is kind of one of those at the top of 80 is what we're doing here is where using machine learning. Let me back up a little bit.
So one of the problems that I'm ancient way back to the beginning is our environments are constantly changing, you know, if your agile you deplane you Absol the time architectures or changing and everything else. He's a very Dynamic environment. I'm so therefore, you know, you bring out a new iPad it accesses, you know a database, right? You know, it integrates with other applications in the system at 8, you know, how tall is part of your system the dependencies are constantly changing so I can be part of what we're doing here is to use machine learning to start to
understand the fantasies this particular application. You know, what does it do? You know, what does it access certain ports in the network does its touch databases and Other Stories repositories at cetera? So dynamically Understand the topology and build you a topology map, right so that once we start to see certain problems occurring in certain areas, you know, they go read right now. I can understand. Well this app to going rate and I see is depend on this database which is also going red excetera. So this is no starting. Is it doing this Dynamic
dependency mapping right and building you this kind of topology picture that he can then use to drill Downs Etc to understand, you know, where some of these at where that were the root cause of a lot of these problems are Taking a stage further learned remediation. What does that mean? What does that mean? What learned remediation is this is always there and I looked I remember I think I could see this without getting in in any trouble. My previous employer right away. If you could get
memory you remember who that is that she had a very good product when they were in the software business, you know, it was a security information event management platform. So it would kind of understand it would have a lot of intelligence and understanding with a truly had a problem. They had another product that we try and remediate that problem. If I got denial-of-service attack, I would go start the Imports and firewall and all these different things, right? the main product that did all the analysis they had a massive customer
base on that many customers bought the other product to handle the remediation. Nobody ever used it right. Now. This is you know, I'm going back in 08 years it 9 years ago. And this is one of the areas we all will lack the confidence right for a lot of this automated remediation, right? Because we lacked let's say a lot of the intelligence a lot of the insights that we needed to get to the true root cause of the problem right then we get to this remediation stage. We go to be confident that all The Upfront piece that I've just been
talking about right, you know correlation and insights and you know root cause analysis is it we're starting to develop a good foundation in that area. This piece is starting to become much more realistic. I'll be at today right you may do some simple remediations, right that can be lights a very non-destructive. Right? But this is important to help and prove this area. I bring up this aspect of learned remediation. So what we're doing here is as you know, we get the insights we
understand the system is bringing forward. Here's where the problem is. Here's what the problem is with this is kind of the root cause and here's the recommended action know I'm seeing this recommended action because the system is looking at historical data when this type of situation happened in the past. And then looking at logs to see what action was taken right to try and remediate that. Okay. And so then it's making recommendations do you want to do this now? It's making recommendations to The Experts the experts we use of sentiment analysis Express and give a thumbs-up or a thumbs-down
and the Machine start to learn a massive amount of expertise is getting ready to retire, right the more of this tribal knowledge. We can capture you know, as we bring the experts in right? I'm going to remediate this promo. This is what I think the problem is or I think it could be one of these three problems getting the experts to give it a thumbs up for the thumbs down in solar system truly can learn gather and just some of this tribal knowledge and even start to Rollin you have the document it right
start to build can a documents write the Sailor, you know, this is what happened. This was the action that would take and put that into Dawkins Etc. What is a very critical area? You know, I've learned remediation that's bringing the intelligence and what we're doing but using your experts as well tell assistant learn bathroom, right? And then this piece is what I briefly mentioned earlier is that we've executed some of these remediation West learn from it. What happened? Did it solve the problem?
No, let's start to look at the logs. Right did my response time improve has performance going back into that, you know a nice dark green band again. I say to or didn't write you know, but start to look at the outcomes learn from the outcomes a game to get more intelligence back into the system. Can you continue to use that at that feedback loop, right, you know to make the system smarter? Standing beside comes which ones work right so that we know to to use them in the future and coupling that you know, of course it with that with what
we learn. I will learn from the experts. Let me start to kind of Summer I so, you know, I told you up front the maturity model where this thing goes of an intake on a details and some of the areas and you know, how we do that what we're doing. I just want to summarize it's like, you know, one of my old professors at University said, you know, tell them what you're going to tell them tell them and then tell him what you just told him. So this is right. Just hold it right so we can win the week in of the key. It is correlation. Basically Gathering all this data from multiple different domains
right networking storage, you know Etc and then based on that information right start to get some of the Insight some of the you know, the cross to mean dependencies, you know Etc right apply contacts, right? King at some of the historical data that we've collected in the past us one part of the context, you know some advance that me of card different times of year, right, you know the context again from other two means right. So we're starting to bring all this together so that we're not just got a whole bunch of data that we could partial insights into we'll get
some real contact the base at all on and then of course Drive action. My name one of the key pieces is, you know what to learn from the action that the action actually do something right just kind of wrap it up now, you know. This kind of this whole area out like I did when I talk to the beginning why is so complex managing I-80 understanding where problems are costly right? You know Recovery 4 and 1/2 hours is pretty printable when you're losing millions of dollars per hour on certain atg.
It's right, you know, we kind of worked with it, you know, I partner here to look at this type of solution and you know that that that we're delivering and have it really doesn't crude and it is, you know driving results and you can see some of the results. I got here much faster. You don't mean trying to recovery right been at 80% you know faster production problem this a drum, you know, you can see the results up here, but in general, you know it is Significantly improving productivity of employees, you know our employees lights. They are experts
in these particular 88 is you know, and getting back up and running, you know, much faster understanding and Gathering a lot of that tribal knowledge at from our experts that you know, that may be retiring. So let me pause their first of all, thank you very much for your attendance and let me see if you get any questions earlier part of the maturity thing. What about helping us find out where's the problem is not Where is where is as opposed which seems like it's an easier problem? Yeah,
then then finding where it is. Yeah. Well, I mean it so that when we start to if you remember the evening I had where I had the topology Maps, right and you know, this is where we're starting to show. Okay. I'm when I start to correlate different data together, right sometimes example I gave was I'd like to see it one of the examples I gave was on network performance vs fault, right and it burns looking at performance. Right? And the reason performance is looking bad was everything was going through that one right there because another writer broken the little problem was over
there right now, but the idea is using machine learning to understand when everything's starting to go through that rotor, right? I'm still starting to see much greater traffic. I never saw before right. So this realtor is actually working and doing a good job. It's just handling much greater traffic. I have to start looking elsewhere in the system, right? So it's still looking at are you Operating within your you know, your your guidelines and everything else. Right? And if you are and you're handling things then the systems Gathering data from other domains right to try and
see if he'll do everything looks like it's here. I'm starting to notice. We've got other problems and other demeans are actually causing this right. So that's going to start to look at where you're thinking. The problem is right because we're car leasing from other two means it's pointing to you know, pointed to the real root causes over here. It's not here, right? My pie bought this is a I've been a mini escalations where we have 1996 experts probably sometimes more than that. And there's a hypothesis that
I call spin spin the bottle debugging try guys it right and now the storage people have to prove that. It's not their problem right or not, right? but at any given moment if I look at look across the fleet, I can probably say If if if storage is give me a green thing. That's not all of them. Right if if you know the network settings. Maybe it's not green. You stay here those apps for apps. Why are you here? Don't waste your time, right? You're green, right? It seems to me to be much easier to
like it, you know by section the problem by saying who is who's good right you go away first, right and and I could use some help to to be doing that kind of I mean if ever you have a green light on. I guess that would be easy but determining with that is it right? And it really comes down to the effectiveness of the root cause analysis right? Because we've got 14 different symptoms and ever in the saying. Okay, you're one of the symptoms you're the problem, right? And that's
the power of the root cause analysis cuz that's where you start to say. Okay, right. Now I'm starting to see this is for the root of the problem actually is right and so it so it's not saying I'm focusing awareness not I'm trying to find out where is the root cause analysis part of it so that I can dance to this is for the problem is not right and it said during that route calls from just a whole bunch of symptoms, right? So that's what it really I guess, the answer to your question is having good root cause analysis is what eliminates that we are the problem not peace.
How do you automate root cause analysis? It's not really necessarily an automation at 8 a.m. As such is a is a data analysis data Insight area. Right? So, you know, the idea behind the events that happened at the same time. And then we human is looking absolutely yes, but the idea is to try and get rid of a lot of the useless data the side of their gather was isn't noise. But what kriticos correlating across different domains and replying the context right?
And then at that point right starting to see an understanding of apologies and dependencies rights of the machine understands all this right? That's no helping to say. Okay. I understand that. I have a problem here is not responding but it's dependent on everything else. I'm starting to get to the root cause but it's really bringing this information up you still going to ask? Sharks that are saying yeah, you're right. No, you're wrong and stuff like that, you know Hoover time, you know, that will improve right? But yeah.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.