About the talk
When applied to Cyber Security, Chaos Engineering is advancing our ability to reveal objective information about the effectiveness of operational security measures proactively through empirical experimentation. In this session we will introduce the core concepts behind this new technique and how you can get started in building and applying it.
Jamie Dicken is the Director of Security Assurance at Resilience, a premier biotech company changing the way next-generation medications are made. There, she is responsible for GRC, audit, continuous security control validation, and security awareness. Prior to that, Jamie built and led a continuous security validation team at Cardinal Health. She also has a decade of experience as a software engineer and technical manager at two Fortune 15 healthcare companies, where she focused on designing, building, and delivering new features to the market. Her professional passions include leading high-performing teams, executing on high-profile strategic initiatives, implementing continuous compliance at scale, championing employee growth and development, and mentoring.View the profile
Aaron Rinehart is expanding the possibilities of chaos engineering to cybersecurity. He began pioneering security in chaos engineering as Chief Security Architect at UnitedHealth Group (UHG). While at UHG, Rinehart released ChaoSlingr. Rinehart recently founded a chaos engineering startup called Verica with Casey Rosenthal from Netflix and is a frequent author, consultant, and speaker in the space.View the profile
Hello everybody. And welcome to our session on, navigating the unknowable creating resilience with security, chaos engineer. We are so happy to be here today. We want to thank our essay for highlighting our work within resilience, and security, chaos engineering. And for having us be a part of their RS. A 365 virtual series. It is truly, truly an honor. My name is Jamie Reagan. And I'm the manager of Applied security at Cardinal Health. I lead a team that focus on continuous security, validation and security, chaos, engineering, and
Marcos speaker. Today is the one and only Aron Reinhardt. He is the CTO of Erica. He is one of the pioneers of security chaos, engineering. He was the leader of chaos. Bring her back at UnitedHealth Group and he is actually the co-author of an O'Reilly report on security chaos engineering. Like I said, we are just so excited to be here and we have an awesome talk line up for you. Today. We're going to share how we believe that security chaos engineering is actually the solution to a lot of the engineering challenges. We've had over the past couple of decades. We will also
give you some real world examples of how we leverage security chaos, engineering at our companies, and tell you how you can go ahead and get started as well. So whatever your area of expertise, I think it's safe to say that we can all aligned on one basic truth. And that's that system engineering is messy. So in the beginning, we start with these beautifully simplistic representations of either. What we want to build or what we think that we did though, but it doesn't take long for us to sour on those initially perfect creation and it's not always our
fault. It's just that complexity has a way of sneaking into our system. And as time goes on, our problem seems to do compound and we got to the point where we recognize that, what we're doing is fighting battle after battle. And what we need is a new radical way to secure and stabilize our system if we can get ahead. The problem that we've been facing is that the old-school approach to both security and site reliability, was that it was design-oriented. And so in theory, if you wanted
to assess the system, you would go. And you want to take a look at things like single points of failure, you pull up your infrastructure diagrams and identify where could latency possibly occur, what different components can interact with each other, but we all know that there are inherent problems with that. So, think about first of all, when is this documentation created? If it's created prior to deployment, then our documentation might not even reflect the system that went into production. If the documentation was created afterwards, it still dependent on
the lens of the person who created that documentation and the memory that they have. Then you got to figure, is it updated? And if it's updated, does it still show every point of integration? Does it show all of the downstream effects? The challenge that we have, is that no matter how we approach this our, our, our process fails. Because it assumes that the document of representation of the system, is actually correct. And I'm sure that a lot of us as Engineers have seen, plenty of times were that assumption doesn't hold. If your
process of a system that if your process of evaluating, I said some rather is dependent on just an outdated or a flat out incorrect representation of the system. Whether you miss remember it or you didn't know all of the details from a start, your evaluation of that system, is ultimately going to fail and they're going to be plenty of problems that you don't anticipate. It was complexity scientist, David Snowden to told us that it's actually impossible for a human to model document or diagram of a complex system. The only way to
understand it is actually interact with it. So then the answer really is not to rely on our own recollection of our system. But rather to learn from them and to use empirical data, and that's really what the heart of chaos, engineering it implementing experiments that clearly remind us, what our real system, Landscaping look like so we can start to tease out those false assumption. So Aaron, I'm going to turn it over to you to tell us how to do things differently. Asteria. It's about evolving towards a new approach of learning.
The new approach to learning is, is one that moves us away from the process of continuously fixing responding when your costly going to war room or are responding to an active incident. Those are not good opportunities to learn. People are worried about being blamed named and shamed. It's just that it's not a good opportunity to learn as security chaos engineering or a proactive exercise where there is no instant know where we're going on. We're able to learn with eyes wide open. So this begs the question, how does a
system become stable early on? In my my journey? It was security cast, engineering. I ran it to on the world's largest payment. Processing companies near was describing a situation where they have their legacy Flagship application application to rent all the payment transactions for the large Payment Processing Company. I end at 8 to get confident, confident understand how the system operated. Well, it really had an incident or knowledge and skill build for kubernetes. The new concern about this transition in the lack of stability for
bonetti's, always stabal stabal stabal and have the engineers became confident, through a series of importance in the best. They learned about what the differences was between, how they believe the system work. Reality through A Series of Unfortunate Events. Unfortunately, warning through that process. It covers customer pay the difference between whatever we believe our system is and how it functions of reality before it manifests in the customer. Don't support like you described before his people
aren't a different kind of space. When when is it routed zookers, especially security, outage or incident is that people are worried about being blamed named and shamed. If you were worried about losing their job because it was to be honest, not about what happened. So I get that thing back up and running or losing money, and that's not a good learning environment. So Casa during it weren't bitten, the world of instrumentation somewhat of a loose definition but folks in the space like that like to break it to instrumentation. Has been testing and
experimentation the testing with a verification and validation of something we know to be true or false. In our world is bi-curious want to see be an attack pattern. Something. We already kind of know if we can do a commission for whereas experimentation. We're trying to drive new information about that. We didn't have before. Many about the unknown unknown to the system. Would you let the form of I believe if x occurs on my system? Why is the play results we never do cast experiment? We know is going to fail, will need you when we think it's going to be true. Cuz if you
do is going to fail this fix if you're not going to learn anything new, PS engineering. So, my definition of chaos engineering is it. It's the ideas to take me to proactively to do something troubling conditions, in new distributed system, to try to determine the conditions by which the system will fail before. It actually feels. It's a proactive exercise. Security Cassens there a news flash. It's not a whole lot different really. This in the security bits are just really more in tune with the use cases in the value. In terms of the engineering aspects of
security or attempting to instrument. So what are the goals of the ribbon for trying to achieve is we're trying to understand where are the dash of the holding fruit? That malicious activity to be successful to begin with before and adversary has a chance to take advantage of them. And the reason why this is so big and so important is a lot of malicious colored bushes, activity out. There would never be successful. If they were the Axis mistakes. We make as a normal byproduct of building things actions mistakes happen, especially with the size 6 scale. Speedy
complexity of modern microservice, architecture spring in a public cloud. Is this? We've never dealt with the speed of complexity post-deployment that we have now and up to the Brewers and accents. Mistakes increases. What we're trying to do is proactively inject, those players in the system determine whether or not we are prepared and can detect those things before, navisphere can actually utilize them. Take advantage of them. So through this process really goals were trying to achieve as we're trying to reduce the amount of uncertainty and assumption, we have inherently the
system to building confidence instrumentation data into things. We don't believe in luck either works or it doesn't and we believe an instrumentation and data and we were trying to instrument and build empirical data that shows us whether or not our system works as it's supposed to not help us build confidence. There are some security cast engineering use cases. There are more use cases documented in the security cast in Django. Run the report as well as in the chaos engineering book, but if you use
cases, you can get started and using them for security on our Instagram spots, security control. Validation is a great one. Also, I really like security, observability to great way to, to understand how, how, how, how well you can understand what's happening inside the system, especially the system security through the kind of output you get from at the log of Ben Stiller. And because we're proactive, we're not worried about it. It's there an outage and try to chase that down. We're able to come to determine what how this log did. It made? No sense. This controls, not saying the
right information for me to make a determinable decision on what to do here in Leslie. Compliance compliance is a great, every cast, engineering experiment, experiment with a security, or availability is to build a base, has compliance value. Rudy, was it a technology work? The way you had a document in a way you thought it did, all that has auditable value. There are a handful of companies that are implementing security chaos engineering today and we are already seeing that number increased which is fantastic and just tremendously exciting stuff. Turn my house where I work with one of
the first to really begin implementing that discipline and we started in the summer of last year of the use cases that Erin. Just describe the one that really drove our adoption with security control validation at the end of the day. We like so many other companies. Put our faith in the idea that the tools, the technical designs and Technology standards that we have, keep us secure. And is not that we don't trust our people but look at the data. So it in the infamous 2020 cost of a database report, you know, a lot of people regard that the key takeaway is that 52% of
data breaches are caused by malicious actors. In really? These numbers have not changed in the last 10 years. But while that's what a lot of people take away from this, what I take away is that there's actually still 48% of data breaches that are caused by mistakes and accidents and honey. Bees are preventable failures in the ones that my team TARDIS. Our leadership at Cardinal, Health realize that we needed a different approach to security one that focused both on building security controls, but also validating that the ones that we had stayed in place and didn't degrade over
time. And really, we wanted a way to get after that, 48%, which we believed was preventable. So, last summer was, when we really invested in a team to fix this and to really start something new. And we first started. I came from a career in software development, but we started to see what are some of the other key skill sets that we wanted to have on this team? And really the fact that we were building a multidisciplinary team. We viewed as Key to Our Success. The one, my team I have somebody who knows a lot about network security. I have
somebody who knows text her. I have somebody who's the former systems engineer themselves and then somebody from the rest in the Privacy space. But as I so delicately put it. One of the things that we want to do is have people with good hypotheses on where some of the skeletons in the closet or buried, so that we can make good hypotheses on where to look first. And how do I address some of these concerns? Next, a lot of things. We do, build a lot of our own validation using our own custom script, and an apis from our platform. But other times, we do take a look and we see what else
is out there. Whether that are whether that's open source solutions that already exist or commercial technology, but that more important. But ultimately at the end of the day, our goal is to identify these unknown technical security gaps and partner with the organization to remediate them before a bad guy, finds them for us. So, how exactly do we do that? So as excited to just go nuts and get access to every system and every project and look for technical security. We knew that we really needed a disciplined and repeatable process for to do
with us and we identified three kegels. So first was that our process had to identify indisputable critical security gaps that when the organization considered the risk we agreed. It was worth fixing. Otherwise, you don't need somebody just planning out a whole bunch of things, but then ultimately doing nothing with the data. So, what we had to do is we knew we had to establish benchmarks ones. That were relevant to our company and relevant to our systems. They couldn't just be a theoretical best practices that people considered to be too lofty to achieve. He
needs to be things that the organization agreed. We had the sex. Second is that our process had to be able to be big enough to see the big picture of a technical security got whereas our previous frontlines Engineers had either seen evidence of gaps in the past or they had hypotheses. What they lacked was the detail to be able to describe where this was what the rest of the organization was and really drive it to remediation. And so we wanted to be the answer to that challenge. And then finally we needed to make sure that any Gap that we identified and drove the completion wasn't
just unknowingly reintroduced in the future. So with these goals in mind, we created a process called continuous verification and validation and simply put we wanted to on a regular basis. Continuously verify that our controls where where we believe that they were supposed to be and validate that they were implemented correctly. And this process has five me and stuff. First is obviously, you need to, you need to understand which control your validating and then you need to understand what are the benchmarks that we're going to assess this control by So largely to give authority to the
benchmarks that were using. We like to use the patterns in the standards that are set forth by our security architecture team in our approved by our see. So but if those don't exist because it's a brand new area that were looking at, that's where we will start to make some of our own recommendations socialize. Those recommendations with the relevant teams and get that buy it. So that again, like I was saying before we wanted to make sure that if we identify something, we agree that it has to be specs. Max is where we get to do. The fun part in. This is really where security
chaos engineering comes in. And that's where we build the automation, to validate those standards. Again. We should be writing our own scripts. Would you be using Technologies? But we need to learn what our systems actually look like. Next week start to create dashboards to show the real picture of that technical security Gap and it says two things. So one is it gives us that real-time visibility into how we're doing from our, with our security posture. What is those are really good to let me talk to our leaders. And when we're talking to partner can use to help Drive these
remediation. And then finally, if we start to see that our adherence to those benchmarks decreases, that's where we create an issue in our risk register. And we have a governance process to drive that through a mediation. So that right. There is the Cardinal Health story, so are, and I'm going to turn it over to you, to talk about your work with chaos blinger. Text Debbie so about 4 and 1/2 years ago. I started this journey on UnitedHealth Group. I was the chief security guard to fix the company and Louisa Friday, and its open source tool, called chaos singer, and I'm
going to talk about the primary example of kisslinger driving. So when we open sourced it, a methodology for verifying in validating that the security we're building in the cloud and of the rest of the time we're undergoing that any of us, transformation to determine that, hey, you know, all these decisions, we're making good decisions and they were functional and effective decisions as well. And so is the process of open-sourcing it. We needed a good example, the rest of the world could understand one expects from an experiment, respect for the
example that we open source with something. We called Port Slayer, which was the injection of a misconfigured are off, right? 14 for some odd reason. All the time in the cloud of the day, the scenario would have you anyway, was a good example, what do a software engineer with a network engineer? Engineer are girls supposed to do and what I am, and what I survived a basic understanding of network, the change to be applied in correctly. Somebody could have been to change out a band that somebody could have filled out a ticket incorrectly and lots of reasons. Why some are somebody
could have misunderstood flow example of reasons why that happens. What do you weigh? So our assumption was just that we sleep in solving for this kind of problem for 20 years, right, 20 plus years. So we so are some. She was our firewalls need to go snowmobiling detective block, this kind of issue and it wouldn't be a non-issue. So we did was you here with this experiment and we started in Not in our AWS ec2 security group. That was that was not always the case. We thought I was about 60% of the time our firewalls caught him blocked it. By
the time that would happen. If the shoe was it was it was a drift issue between a commercial and arnaud Commercial environment. So, this was proactive, remember we able to fix that? It was a non-issue. But, so, that was the second thing we learned was the cloud native, configuration management told that we were using caught, it bought the change. Every time do something. We are barely playing for barely plenty, for kind of was catching it but he and every time the third thing we learned was that the time we didn't really use a symbol used for brown. Homegrown for the
security logging in Meijer a solution that we were expecting. I had a little faith that their alert was extra, could be generated from the events from because the login is percent of the configuration management tool and the firewall to it that your correlated it alert right now. So that was great with your confidence that are that are homegrown solution was actually trying to run to work. That's the third day. We work for thing. We learn is when the Operation Center, the panelists did not know what to do. They couldn't tell me what a device account structure came from
because we had with a lot of account. We have both not commercial. Commercial & residential address. I forgot where that came from could take 15 minutes, 30 minutes after three hours. If, if s, that isn't place going to take a lot longer because the point here was, is that had that actually been an outage Earnest, it has millions of dollars on there now, and but there was an outage. There was no incident. All we had to do was add metadata to that event and
we have fixed the problem and said that's kind of it. So that kind of Straight sword of how we're able to pass an instrument part of the chain. And then the idea is, once you're able to prove your cast experiment successful. It becomes more of a Russian test. You run it, more overtime to radically. Thanks, Aaron for that. So if you're looking to get started with security chaos engineering, and you need more of a foundation before you implement, the great news is that there is now an official E R O Reilly an official O'Reilly report on the topic
and it's free. So it was written both by Aaron and by Kelly Shortridge, who is the VP of product strategy at Capital 8 and it contains a whole bunch of the story for people who are doing this in the real world. So for example, if you've enjoyed the conversation a little bit about Cardinal Health and you want to learn more ID to, I go into much more detail on that report. And we have people from across the globe who have done this at their companies as well. Including verica where Erin Works, Google Capital, One and others. I really can't say enough good things about
this book because it does contain so many just real world examples. So, please please please, do check that out. Next, if you're thinking that security chaos, engineering is just to cutting-edge for your organization. This is where I really encourage you to adjust your mindset and think of this really just like standard testing. As I said in the beginning, my backgrounds actually in software development and what really attracted me, attracted me to security chaos engineering was that the parallels between this and software testing.
We're just too profound to ignore. It's just that instead of testing that your system meets, the functional requirements that were sent for your testing that work that your systems meeting, not only the security requirements, but the resilience requirements as well. And you can do things. Like the use cases are really super similar. So you can run this in a non-profit environment. That's a waiting promotion to production to make sure that you haven't done any unknown damage. You can even look at this like where my team wants to go, which is taking a look at
this, like it's test-driven development. So where we really want to be as we want to start partnering with our security architecture. Tina and others that as our company is developing new patterns and new standards. We're writing our test upfront. And as those new controls are built. We see our test start to pass to know that we are deploying a new security control in the way that we think is best for our company. So, when you think about it, that way, security chaos engineering really stops to stop being just so esoteric and really becomes logical.
As I see it just as the world is software engineering in systems engineering. First adopted testing methodologies. Really the rest of the systems engineering world is going to do the same and the good news is that once you've gotten past that mindset barrier and you're ready to experiment. It's it's possible to start very, very small. I think one of them is security. Chaos, engineering, is that before you can do anything, you have to have this massive systems wide experiment in production, that teaching multiple weeks to build requires, VP, approval and everything,
but that doesn't have to be the case. So if you take a look at the example that are and had just shared, really, what I see here is multiple the opportunity for multiple tasks. So instead of doing all of this at once and and recreating Port Klinger in your environment, you can start to do things like test. Maybe just the fact that you receive that log message. When when something was changed, you can even do that manual. You could act, you can do something in a non production, environment or safely in a production one, and verify that you get a log message.
Similarly. If a log messages recorded, you could just pass the incident response portion. And what this allows you to do is you can start to really, you can start to get a couple of your examples and then organically grow the business case to do more concerted efforts, with security chaos engineering, and security control validation. So if you're wondering where to begin, you may have a few high-value low effort Targets in mind, and that's awesome. And you can start there. Like I said, I guarantee you will be able to find that you'll be able to prove
your value pretty quickly and organically grow that case for sustained investment. But if you're like me, maybe you have a lot of discrete high-value testing possibilities and it's hard to make sense of them all. In this case, to me. It's okay for you to be what I call the right, kind of lazy and be a little selfish. So, start to look at the health of your team. Like, do you have engineer who log into your systems every day? Just to make sure that services are up and running. Where do they do this on the weekends or in the off hours? When
you are that right. Kind of lazy, you start to put the engineering effort up front, to be able to build some simple validations and some simple tests that will push alerts to you, when something fails. And in this case, to me. This is the beauty of security chaos engineering, because it allows you to start building confidence that your securities working. Even when you aren't, And so there you have it. As you can see, security. Chaos engineering is a field but it seems so esoteric and out there but really at its core
it's just incredibly simple instead of allowing the complexity of our systems to overwhelm us and caused. So many opportunities for mistakes and accidents. The solution is to really flip that model on its head instead of relying on our outdated documentation and our preconceived notions. And instead of relying on our production incidents, to teach us about our systems. We can proactively go in Eyes Wide, Open, ready to experiment in a time of calm and actually learn about our system. And when we do this, we we actually get the opportunity
and I do see it as an opportunity to, really remember our systems, for what they are, and not what we hope for them to be. And it's through testing and experimentation that we get the opportunity to identify ways for us, to proactively secure and stabilize our system and get ourselves out of the reactionary, fire drill, engineering practices of today. And there you have it. We hope that the Assassin was valuable and that you learn something. Again, major major things to the organizers of our essay for putting this together and for highlighting us and our work
in both resilience and security chaos. Engineering, we were incredibly excited to be a part of the 365 virtual series. And a few few things to keep no one if you are looking to learn more, the there is a link at the bottom of this slide. That's not only going to get you a free copy of the security chaos engineering report, but it's also going to get you a copy of the official O'Reilly book on chaos engineering as well. So both books are incredibly awesome. Make sure you check those out. And again, they are absolutely free. Next. We are going to
be continuing this conversation in a live Q&A session. If you're joining us here today, so, we are so excited to be able to go a little bit more behind-the-scenes. Talk about our stories and answer, whatever questions you have. And of course, we provided our contact information. If you'd like to continue the conversation there. Thank you for everything.
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.