About the talk
AI reliability has yet to be considered in data science, yet is required to predict how a model will perform over time in the field. What exactly constitutes AI reliability, how do we maximize and measure it, and why is it needed? This session will answer these questions as well as present a mathematical algorithm for optimization and consideration.
Dr. Celeste Fralick has nearly 40 years of data science, statistical, and architectural experience in eight different market segments. Currently, the Chief Data Scientist and Senior Principal Engineer for McAfee, Dr. Fralick has developed many AI models to detect ransomware and other security threats. She leads the McAfee Analytic Center of Excellence to enable Agile model development and eight Community of Practice groups world-wide. She has chaired numerous global engineering bodies, served on editorial boards, developed countless standards, and led corporate-wide process and product development efforts at Intel, Medtronic, Fairchild, and Texas Instruments. Dr. Fralick received her Ph.D. from Arizona State University in Biomedical Engineering, focused on Deep Learning and neuroscience.View the profile
Hi, my name is Celeste fraelich. I'm the chief data scientist and Senior principal engineer of McAfee. Today, as a summary of the agenda. You see, I'm a slide. I hope the audience will walk away with old and new ideas about AI, reliability and begin to put into place some key monitors in devops and MLS. These monitors are the Baseline the foundation of AI reliability and I will give you some proposed mathematical models for AI reliability. That will end up with the population of an important AI reliability term. I want to introduce today called me time to Decay
mtdb. So let's walk you through how to get there. First of all, to start at the beginning and there are three words that have become very prevalent when speaking about AI. The first one is robustness and it's got a whole bunch of cinnamon. Cinnamon synonym synonyms. Sorry. I must be thinking about breakfast. Typically. It has referred to the ability of a computer system to cope with errors during execution and erroneous input. We can extend that concept to AI. How effective is your algorithm
while being tested on a new Independence? But similar data set. Now, this property is also known as algorithmic stability. And in fact, in many cases, we minimize the error rate such as mean squared error and maximize Peak signal-to-noise ratio to ensure. Staple algorithm is resilience input, Evan flow does a model still perform the same way it also implies toughness. I like to think I'm very resilient and the ability to adapt to risks and continued operation
of core business functions, despite challenges. Now the last time I had, the Boy Scouts for many years and now it is raced by missed the National Institute of Standards and technology in the United States. But these three words are critical to describe what we expect out of a, I but we can use it to form the basis of a, i reliability reliability. Is he intended performance over time? Performance can be accuracy. It can be false positives and you see, or whatever your business is States for.
It was important for you for the business. For the customer are three common words combined to support AI reliability. And they ensure reliability, can be measured, via the monitors each word represents. An AI reliability can answer. The question of how stable is your monitor. How stable is your algorithm? Can your model? Does it have the ability to adapt to risks? And can you adopt a, i without fear, and that's where by us. And I am El vocero, Machinery times in the play an Ethics. So these three words combined to
form a, i reliability and begs. The question of most importantly, how long will my model last in a field? So, additional reasons for a I are numerous that I'm listing here at your car or your airplanes reliability, provide a sense of security and certain expectations, reliability can be critical to us as customers. But the company CEO is concerned. Groves, he's concerned about competition and he's concerned about customer satisfaction to say that as a Revitalize technology from the discovery of a, I buy John
McCarthy in 1956. And it's timely know, I'd say it's probably required right now, to assess aiai is reliability. Now they are reliability mathematics. They've been around for a long time. They're applied to everything from our cars are refrigerators are computer chips and Bounds to the very clothes. We wear. It isn't manufacturers prediction of how long the product will perform as intended function. It's the best by date of your to the recommended cleaning for frequency of your teeth by your desk. It's graphically
with infant mortality with a decreasing failure rate. And this is known as quality, just as the great car that you bought, and it works. Just great. Then there's a happy place known as a normal life or useful life where you have a low or constant failure rates and then pretty soon. You have increasing calls to the, the car repairman. And you have an end-of-life, we're out with an increasing failure rate in your mouth. Starts to souring and things like that. So the general definition of reliability and braces mathematical models that build this bathtub
curve over yours and you're. So let's list for Hardware but also evolved or her years and let me quote because I I I think it's signed and incredible quote. It says it's an estimate of the level of business risk and the likelihood of potential application failures and defects. The application will experience was when placed in operation. Great definition of software reliability. We're very familiar with the fact that it can utilize errors detected at each stage of development. It's even using neural networks and other AI tools to assess and predict failures
software reliability, still addresses those core items, such as functional and structural quality, typically, and the sqad organization, as well as the old standby there, that I mentioned that they also use common at Oral tests designed to identify the optimum number of tests needed for the coverage States Tire. Microsoft has done a great job at, I'm creating some reliable tee measurements, such as the Office customer experience Improvement program or
ceip and Microsoft. Reliability analysis, service. Mras. You can find details of these acronyms and other Microsoft programs within this paper here. Now. I'm sure you're all familiar with us. He is cute software quality model on reliability. But what I want to point out and what I've highlighted in yellow, there is the fact that they also call out those, those three important words to resiliency, stability, or robustness, and all the important monitoring and monitoring is so important to to the critical concept of
reliability. However, both software and general, reliability equations of the past Hood curve. Have yet to address the reliability of a. I and as I don't think AI reliability fits into either one of them, but we can certainly utilize some of their calculations. And so, a new paradigm needs to be considered, so we can predict the life of a model in the field. So consider, then these questions about reliability. How to get predict reliability of an AI model? How do you light robustness, resilience and trustworthiness? What are the factors of features that drive reliability? And
how did they change? And how does AI reliability change over time now? To give some guidance to answering these questions. We must look at the entirety of AI development and deployment in the field from Soup To Nuts. Now most developers are anxious to create. The latest exciting model. I know I am and as developers sore and Dev. Officer, development operations has been a Mainstay of developing code as well as in state models for years and years, but necessary, but it's not such as more than just suffer a code. It is based on curated. It requires
targeted, actions built-in from the beginning of and during model development. But once also monitored, you also have to consider the deployment of the model in the field with your customers. Show implementing monitors at the beginning of during. And while in the field, constitute the typical organizational structure of devops and ml option. While the picture here on the right, is just an example, yours, may be more or less delineate, adore skewed, most likely
devops most are, but what we need to do is Monitor designed and checks and balances as well as field performance and now ai is moving from Development Center to consumption Centric activities and we'll find that. Just as in devops. That was ml Ops. There's a three to 15% profit margin increased with, with implementing MLS, which includes model monitoring there in the in the lower. Right? In fact, ml Ops markets. They believe will be Raiders and four billion dollars by 2025, which is only four years away.
So, it's important to consider that you look at those devops and ml Ops as well as field performance. Now to review what we've discussed considers that buses resilience and trustworthiness provide input into monitors that can be implemented in devops. Ml logs. Or in the field. We ensure that there is always a feedback loop as you see at the top there. And we want to make sure that that feedback loop goes from into devops and ml Ops to place alerts revisions thresholds for root cause analysis. I've
referred to the term monitors throughout to talk. That's far. So, let's take a look at some of those examples. Now, I don't make sure you can. Are we go? All right. So these next two slides I know, or maybe a lot to take in but what I want to do is just consider suggestions on monitors. And what I've done in these next few slides is just giving you an idea of some of those monitors that can be placed in ml Ops both or devops. You should all be familiar with some of these. But what I want to do is make
sure that you look at these, from a prioritized standpoint things, such as low hanging fruit like, skewness kurtosis variance for concept draft volumes and types looking at down sampling. Storage your Cash's. Looking at your air rates, are rockers or her, a set of of time. These Anna in itself, can be a good place to start. The second page is additional monitors that you might consider adversarial machine learning or model hacking. We're poisoning or a vision can drive up your false positives or false negatives. Your bias. Certainly everyone of us have it.
We want to make sure that we also consider with him. Not just a sampling measurement, algorithmic and pray but also prejudicial. And prejudicial remember we have societal changes such as racial unrest? And the US so your bias may become much more sensitive or for a. Of Time. How does your AI model react to that? So you have to keep in mind some of your societal concerns as well and certainly explain ability where you can explain explain ability by Design with some of the
line webcam in shaft monitors. And interpretive interpret ability, which is something that is not very, yet. But it implies that you have UI and ux principles, first user interface of principles first people to be part of the decision-making, an AI explainability helps that quite a bit as well. Certainly anomalies and for this crowd cybersecurity, obviously is very important with the number and type of malware including exclusive detections as well as the false has
and the threat families have the threat families changed over time. So after you've completed looking at these monitors and you selected them, you've Baseline them and you've measured them now, You have a model that has an expected lifetime noted by the read. There until it starts to K. Noted. Here is a solid black line. When retraining a cruise because of the Decay. The model is refreshed as as noted from the red arrow and measured by the business goal. And in this graph, I've selected accuracy seen on the y-axis, yours may be different.
And the Decay causes are highly dependent on the Decay causes there in the rectangles blue green pink and purple ish. Their highly dependent on what monitors you have put in place, such as low as those listed in the last two slides. So they may be single. They may be cumulative, causes. They may have different uses of units of measure. Do you need to consider each one of those now? I know you think that that's probably it, you can go off into your calculations. But of course we have half the
math. So first of all, we want to consider the Calculate the instantaneous Decay rate of our model. We want to consider the conditional probability that b will occur given a is now in, to have now, that's just for to cause this. But of course, you may have part of me, you may have perhaps a couple of dozen causes. We don't know that it's based on what you've monitored, your probability calculations will then be converted to a rate? and after you place them into a You can
let the change of time. I purchased a zero and obtained, a derivative of M, imprint of key / RV, an equation. Three. And lastly, you'll get an instantaneous failure rate of Acceptance Now. However, it may be more useful for you to consider a single average number or even a cumulative Decay rate of the original model. So here I provided. By integrating the instantaneous, feel your rage over the interval and abiding, by the time difference. In a time, difference
may be different. My bi-weekly might be date. We might be monthly might be x 0 might be when you go live time, one might be 60 days. Time to might be sixty two days time. Three might be, you know, 80 days. It really, really depends. If. So, at this point, when you're calculating, you're averaging average Decay rate. This is all based on on the first four equations that I showed you. So, by integrating it instantaneously or the Interpol and dividing the
time difference. You can achieve an ADR or No, and equation, 6. Sorry, just checking my notes here. As the monitor monitor changes are measured. You may find that too much of Decay rate is required. And and I like there's quite a bit. It's a small fraction of a change in a few monitors may be enough to alert that the model has decayed perilously. So this is well-documented in a 1986 Challenger, explosion disaster. Where an O-ring cause a fatal error. Do the cumulative failures? Unfortunately, there were eight o rings and I'll
have the function at point nine, nine or 99% the system reliability. Therefore was .99 days. Power or .89 a potential failure rate of one one or that one out of nine instances and the explosion occurred in the 12th shuttle launch. These calculations are based on a general reliability. Siri typically software checks and balances can be added to cumulative probability equation. Now also considered whether the change in monitor is independent of others or his
dependence on others. Now, what we want to get to is meantime to Decay. So the probability distribution function is related to the CDI for a cumulative distribution function of being an integration equation that I haven't shown. But it's it's in most reliability books and can be related to the monitor. Failure rates with equation 7 here helps to specify The distribution of multivariate random variables and therefore, it's great at reflecting the monitors we have in place. Now, meantime to Decay is what were what we're looking for.
Lambda is failure rate and since instantaneous failure rate, h7h of tea berries, over time. We Define a single average number that reflects Behavior over a specific interval and can be reported to analyst, it can be reported to customers competitors. And investors to demonstrate their robustness, the resilience and trustworthiness of any AI model. You use see how I think that back to those three words. Alright to summarize. Defining and implementing in
implementing. Your monitors will be the hardest challenge that that's definitely the hardest. The Mathis is simpler just to let the monitors you can utilize to slides that I provided for you here or you can review The Nest in ISO documents for added monitor insights and robustness, resilience and trustworthiness. Identify and improve current monitors and your devops MLS. And in a field, make sure that you have a feedback loop and once monitors are implemented,
you can collect the metrics such as the change in units per time and utilize these monitor metrics within your reliability models. And you can calculate instantaneous cumulative, average, failure rates and you can come in age and a reportable and very public meantime to Decay. Now while we have enjoyed developing cool new model since AI has resurfaced. It is important for all of us. To consider long-term implications improve our processes and set our sights on
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.