About the talk
IBM has a long history of improving the reliability and availability of systems ranging from the largest of mainframes to the smallest of microservices. As part of cultural and organisational improvements we’ve sat down and codified a list of Chaos Engineering principles which define our view of Chaos Engineering. These principles do not replace existing principles, but adapt them and match them to the requirements we have from our clients and from our own internal services. In this session, we will describe a little of the process of getting engineers from across to agree on these principles and present the principles and lessons which we agreed upon.
Trilingual, confident, and tech savvy, I’m known to be the technical conscience of my team. Focused on enterprise architecture for mission critical services, I have a reputation for rightly orchestrating and delivering IT modernization programs covering apps, platforms and infrastructure including the grey area in between. I work with architects, developers and operation teams to introduce new availability patterns and cloud native architectures covering Kubernetes, Site Reliability Engineering (SRE), Chaos Engineering, and Service Resiliency in Multi-Availability Regions.View the profile
So, good morning. Good afternoon. Good evening, and welcome to chaos,. My name is, I'm the Chief Architect for the IBM, always on practice. And basically, my job is to who helped a large audience to think differently about availability residency in reliability. And I was today is actually give you a VM is modernizing the reliability and how Kelsey engineering plays. A very important role to give us evidence and confidence that we going to be able to meet our service and service level agreements. A little bit of contacts. First to IBM is 110 years old.
We run and managed the backbone of the world economy. That's not them understatement. But we manage we designed bills and manage 40,000 mission-critical workloads that are handling of billions and billions of financial transactions of every single year. No course, I don't need to tell you how critical those applications are in the end. Most is not exclusively. All of our clients are regulated. You know, this is a tough environment to come and adopt new technologies, new religious practices, right? So we have to be very, very careful. Of course, this this, this
invoice some challenges, you know, like, like, like I mentioned most of our clients are regulated in the transportation Financial Services. Retail, the healthcare songs before she changes not always easy, right thing, even though most of our clients are transforming in modernizing, they're doing it in their own face and by being a little bit more careful, right? I'm the one. That's the reason why they're careful is because they have a lot of technical death, right? And many many other applications
20-25 years ago that has been developed 23 5 years ago, right? So and nobody has them right? There's also many applications and have been built by vendors that no longer exist. And there was no Investment or appetite appetite to actually transform or modernize those applications. You know, that's also challenges that we have when it comes to skills when we do with our clients and then over a culture of many of them have been in there jobs for for tens and tens of years, you know, it's
not very easy to just come and see you guys there. Something called calcium Jeanine that you guys should do many of those applications run a strategy, right? And the way we come and we try to change the mindset is admitting, you know, that basically everything breaks know. This is a important message that the Are accepting you know that everything breaks and that we should plan on it. You know, if I planning it would be able to know how we're going to behave as an as an organization. However, Ops our srt's will behave
even though you can hold the application without you to behave right now before I thought, I was curious engineering, you know, one of the re-engineering, we Architects some of their applications via Brazilian to be reliable, you know, to achieve those five nines that they require, you know, I always say that application must be to get the 5S that your Flyers are floods your who's in. There are fat fingers. Don't look very quickly, you know, your fires are your component failure. She notices when the Enfield switch to router
firewall, whatever, you know. Now, your clothes are your physical figures me. And let you know, this is when your links your original chain. Length are data center. The failed, you know, we call those the floods, you know, you're important because when you look at the majority of causes of outages in the world with our clients, you know, just so you know, it's me. Only software applications and human fingers are fools, are or what they refer to. As sneaky as our he's such a Long Beach Engineers that are not
doing properly. There. They go to a frog's, do something. And that change caused an lt4, Chevy Reynosa lose, all your fools, and finding your fat fingers, or you're a sorry, that they are not sneaky. There. They're following the rules very, very well. But for some reason they were the the lunch today, command line command the other than extra zero, or the syntax error and then that, and I wish I could cause the entire control plane to, to, to feel, right. So again, your fires, your floods, your fools, and you're touching your, okay. So course this means
that when we architect the application, when we re engineer that we have to take all of these ass into consideration, how can you build a multi-active multi-regional, multi-agency environment that can really from from? I mean underlined cloud in the underlined? That's what we do know is this again? Hope is not a strategy and we need to understand the behavior of our applications and kept meaning to us that he's hiding. Something that we've been doing
for a long, long time. Is it's very similar to site reliability, engineering. We are IBM you to the site reliability engineer for the past 20 years, but we didn't call it fight. Reliability engineering. Even kill some dream again, recently adopted the term PSE&G, and we used to call it a failure or into testing. That's what we used to call it in on. This is how we stumbled over our service. Of course, that was not in the cloud world that those not in the cloud native way that. That's why we we try to actually rely on what the industry with everybody
on the calls, ear built to actually and promote skills engineering and we build upon that to sing. We built our own principles. No, IBM's principles of chaos engineering and also a abms. A methodology flying board. Changing safely on the, on the principle of Goshen drink and I have to remind you this is just a sneak peek. Know there's a lot of documentation that actual everything is all. I'm going to show you but still important for you to understand or know what IBM is actually doing in that space, of
course, the first principle of current or strengthen currents reliability. The spoons that are currently in place, you know, it's it's not a good idea or after work. If you don't have certain Foundation of business continuity in residency, and then you up like us, engineering, girl, ponies. Do you have to have a baseline unit to build upon, you know you. But this awareness that religion Is not that Global thing anymore, you know, many clients or residences at Global feature know an entire data center that is has a disaster recovery site, you know, I
do my storage, visit that application for everything. That. Doesn't work in a cloud cloud world that doesn't work in the county of way. Right? So what you're trying to say here is that a built upon the current available, reliability disciplines, right? You know, again, in this world you have a lot of operation seems normal. The functional requirements on Alton the responsibility of the developers and the non-functional requirements. Such as availability in residency are the responsibilities of the cops or the authorities, you know, that doesn't
work anymore. I have to come together and there stands a business service. You know, we we we we use the term as a service instead of us. The Mooresville of application or infrastructure. Note a business service is built upon applications, middleware Brokers, kafka's storage, clouds, on understanding the critical pass the transaction of critical. Path is very, very important there. No true to to understand flaws or to try to start to think of experimentation. You know, where can I inject figures experiment on every component?
You know, it's nothing else that infrastructure oriented the UPS UPS engineering just by adding latency or removing the volume, you know, that's on the infrastructure level, the great way to start the three things that he has engineering, or fissure injection should be done at the dentist. We have to take our time to understand that transaction end up like your engineering on every performance when possible. Trifle production, you know, regulations don't allow you to do just about anything right
to refuse. Very careful and how to promote Pierce engineering production. So this is why we used to dermatologist try. If you know, it's good to start in Davis, go to start in the contained environment. But we don't, we, we, we, we, we believe that engineering has more value when it's done on production. So gradually applied in production. Once you have more and more, contingent action, once you start to understand the system of, you can figure out how to continued impact, how to
set proper full box, how to how to set your own kill switches. So on and so forth. So that's very very, very similar to Industry terminology, which is you have to Minnie Minnie, Minnie Mouse with the blast radius for us what our end users will see our a business impact in the world. Are they not going to see a blast radius of your business? In fact, so A measure, of course, learn the ability. This is very very important and improving Improvement all about going back to what we thought would happen. We thought I would be
application or the business service would behave and the proving or disproving that the hypothesis. And when we do understand the Behavior, now it's time for us to improve Improvement. Can mean many things that can mean that an SRE, you can do some things to enhance that to certain fallback mechanism. Or you could mean that we have to raise a new feature request or a bug, the bug that's against those applications because the it needs a significant development rights to measure during a natural increased complexity garage with a similar to contain the impact
except that in real in real life, but you going to have multiple cages at having at the same time, right? And if the architecture of the application and the entire business service is properly engineered. If Peter has happened at the same time, you should be okay. Right? So this is what you're trying to say. No, try to come gradually increase the complexity of your failures. Example of this is actually doing multiple failures of the same time. No, that's not a bad idea, you know, in the beginning, of course not, but as you gain more and more confidence
is very important, finally. And this is one of the most important thing is to actually socialize, you know, you need to switch your life skills, engineering you need to figure out by you, by conveying. The message that's by balancing a short-term risks for longer-term reliability and you might get more education. You have. Okay skills engineering. I got more support from your management and also and that's for us, that's very important. Again in the regulated industry
socializing. Very important to know what are the benefits. Will devalue? What are the risks? You know, this is these are things that you need to to be aware of and they do they appreciate that. So we are ready to open the window. When we talk like you're from Janiyah and you know, of course, this is just a sneak peek, you know, we have a lot of documentation to explain every single one of those principles number to let me know. I mean, it now that we talked about the what, let's talk about, how, you know, how how do you start and how do organizations, do chaos in engineering? You know
why we built our own methodology in? Are we? We have is a large consultancy. We have methodologies, and we just build one specifically to tackle a reliability. And here's here's a sneak peek into IBM, engine, methodology, soul. Personalized me, the major points. Every Point has a lot of activities that need to be done. Right? So like I said, once you are convinced and wants to do you have the buy-in from the organization that you need to do, to, to do. He has engineering. And sometimes you need to, to do to understand the system to identify
flaws, and an issues before you you come and you, and you, and you, and you, and you, and you, and you get organized organizational, agreement rice. Also means understanding you just outages post-mortems transactions such as discussed, 30 or so on this horse rides. That's going to make you more aware of where to start. You know, if I think that the starting fullback mechanism is not working that we start there, right? Gene organizational agreement that God can
come before or can come out. After once you understand that or what's your convincing? A certain slow exist in your system, we need. And if you want to to prove or disprove it using Kilz engineering you must be in the organization. Everybody must be aware of what you are going to do. What are the risks? What are the benefits of horse by socializing? Help you a lot. I create a hypothesis and the experiments. Pretty sure that you are. Just so, you know what this is all about. We call them experiments. Should instead of fluffy do injections or attack scene with
experiments will be proven or disproven. Okay, so I can only come after we understand the system very well. Don't deserve it, but it is a very important topic. How can I measure, how can I understand the behavior of my application as I experimented on that? Right? So I need to make sure that I have the right up their ability to rules in place rights, prepare experiments. That can mean many things. You know, if you're starting with a small team uniform to the game days, you know, you don't need to. If you do what you want a Victorian. Are you 18
Roman that you can escape being strict and very well? That means you have to bring everyone will stick holder and then make them know what they're supposed to do is okay as Commander, who's the who's the who's the person with the authority to just apply to the curse? Which were paroled back then Tire experiments and then of course you need to run. It also scripting whatever you injection you want to do, you know, that's also part of the preparing. The experiment in the running of the chaos experiment is is actually a script or any tools used to actually submit that
experiment. Once your extremities actually running you and and and it's a who it hopefully it was contained the right way and you didn't have a big impact on the business. It's time for you to analyze the results in to see if you have proven proven or disproven. Your hypotheses. Are your fallback mechanisms in place or your reliability components, you know, your circuit breakers, your little dinosaurs are real. The traffic managers are the old working the right way, you know, your wheat rice on a horse, right?
Even if you prove or disprove, you know, you need to communicate through find me because I'm pretty sure we will know that you're going to find some very interesting results, you know, and those results are things that you can sing by yourself. You know, what, were you going to have to raise a service request for a bug request? For a feature request, animal against the team was only thing that you need to communicate with your findings. Bring all of the stakeholders in one place and improve Houston. Turkey Trot. That's not mean
that we all know this. And this is very important. You have to start small and grow. Your blast radius means you must touch other dependencies, you must attack other dependents and try again to really because you're dealing with business services at our mission critical, you know, you have to charge The surface of your experiments. That's very, very important because like I said, there is no value value value in the benefits during happen. Only when you
really expensive and you slowly and gradually expand to start everything all over. Again. I hope that I was able to give you an interview, only 20, 20 minutes, short sneak peek of what we're trying to do. And what you're doing with our most mission-critical workloads. I invite you all to visit some of the things. I've been too busy with chaos, you know.
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.