Events Add an event Speakers Talks Collections
 
Doug Cambell
Engineering Manager, SRE at Grubhub
  • Video
  • Table of contents
  • Video
Chaos Conf 2020
October 8, 2020, Online, USA
Chaos Conf 2020
Request Q&A
Chaos Conf 2020
From the conference
Chaos Conf 2020
Request Q&A
Video
Self-Service Chaos Engineering: Fitting Gremlin into Grubhub's DevOps Culture - Doug Cambell
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Add to favorites
211
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About the talk

In the era of DevOps and self-service culture, human processes are often harder than technical ones. Rolling out Gremlin to our infrastructure was easy, but enabling engineering teams to efficiently and safely practice Chaos Engineering was trickier. In this session, I'll share how we rolled out Gremlin at Grubhub and how we educated and enabled all engineering teams to use it.

About speaker

Doug Cambell
Engineering Manager, SRE at Grubhub
Share

I'm Doug Campbell. I'm a site reliability engineer at GrubHub. And today might be coming out, self-service, chaos engineering, and how we fit Gremlin or any chaos engineering tool kit and has a debit then Desiree culture. So before I really get to my slides, I'll do a little PR for GrubHub and get some contacts on why reliability matters to us. For anyone that doesn't know. GrubHub is a food ordering and delivery platform with the goal of connecting, diners to local restaurants. As of August, when I grab the stuff, we had over 20 million active diners 300,000 restaurant. Our platform

and a murder on $650,000 a day. I'm an outage or problem can easily ruin someone's Day from the drivers that can't get any orders to make money to the restaurants. At the deal with angry customers. If an order is wrong, to the parents with grumpy kids, waiting on food. That shows up an hour and a half late. I'm sure. Almost everyone here. That order food often has experienced some issue on whatever platform it is, you know, you's and knows how frustrating it can be Amplified by the fact that you're hungry and angry, so engineering, a GrubHub and as a whole, not just a story, is relentlessly

focused on minimizing what can go wrong and building fault-tolerant distributed systems to make sure Diners divers and restaurants, have a positive experience with GrubHub for a two and a half years. Now, out of Chicago first. I work with our delivery Logistics protocol, which helped make sure everything for my quarter round to change the driver peanuts went smoothly. And now, I'm on the answer. 18, phones are cicd tools and work clothes, which includes things like has engineering and how we can leverage chaos to make more reliable. Send the title mytalk, Imogen,

self-service. And before I get to the details, I want to explain what I mean by Self Service engineering. And a lot of ways. I just need animation but specifically automation that is made to be used by all in the mirror. Not just operations are a sorry. It's basically what we hear about is like a devops culture but kind of loaded term. So I like to avoid it. It's not just automating infrastructure creation. What is making those tools usable, and approachable to the larger engineering organization? I like the analogy of a vending machine. I want any engineer to be able to go to the vending

machine and press a button and get an S3, bucket or deployment or cast and get what they need without ever having to approach a. Sorry. You're some special team. I work at a few places. Now that really focus on the self-service culture and it seems a lot of organizational operations. If it's free, that's free to do more. Impactful project were like, he has engineering, it helps reduce those knowledge Pockets or knowledge silos since no single team has to do all the work or has all the knowledge

and it helps treat like a culture of trust or Engineers are empowered to do with a meme where we nsre trust them to make smart decisions and the devs. Trust their appropriate guardrails in place to maintain our desire to maintain this culture of Engineers, being able to self service, really drove how we implemented cats engineering. Why do we want that? Like has engineering? I thought why reliability is important to us? Why did we think chaos without? So, we have something to your product. You know, we had what we thought was good observability. All of our services at asolos

Define. We had lots of custom application metrics and developers are on call and they support of the applications. We felt pretty confident in our ability is to alert on and respond to problems. But we didn't have was reproducibility of problems like this infrastructure, level fault, injection, to replicate problems or away to experiment at all. And so, We didn't feel confident in our ability to call. Reproducer find new problems. We did have some hockey ways of introducing failures, but it usually required an s or E to manually mess with

individual house. Like a diaper table rules are fake load or mainly kill knows. It was unstructured hard to use and reproduce in its scale at all. And only a few people really have the power to even try it. It's something we really only did after emergencies. And now before I gave us no way to experiment with failure. And so we felt like with some level of infrastructure for injection, we could validate or asolos before we have problems out for us, and we wanted this platform for experimentation, tell so-and-so engineering. And the

idea that through breaking things, we can learn a lot about how our systems work and use that knowledge better. Our infrastructure, like most organizations is constantly growing and complexity. And with that are listed assumptions on how a service behaves gross. So your questions that we are engineers at GrubHub found ourselves. Asking often. It doesn't load shedding work the way. I hope it does. Actually stopped a heavy operation before and caused Downstream problems. Does my custom metric tip? Going a weird scenario. Does a message, get processed again. Pakistan

during that goes down. And what's the consequences of losing an availability zone or what about an entire region? And like I said, we had some ways of testing these problems but no easy way that we can put in the hands of our developers and give them room to experiment. Could help us answer all these questions and I'm covered the ones that are listed and we really believe it. All. Tinley developers know their service is best. We knew that whatever. Cool, we use you need to put the hands of our developers to get the best results. Are we? Also knew that ain't true. We reintroduced would need to be

in a gray. Existing workflows until like, the self-service culture, but we really wanted to give each engineering team, the power to its grammar. Just throwing a new tool at at developers. Makes it one more thing. They need to learn but if we could fit into our existing workflows, we believe we can really increased adoption and spoil. Our Gremlin is too cool. We ended up picking. I just kind of to its ease-of-use and compatibility with our infrastructure and make your eyes so that we can fit into our existing workflows

first. We had a grandma to all of our services in Pre prod. We didn't want to do a slow roll out. We learned in the POC. It was a pretty straightforward to let me get the most value. I just putting it everywhere. And Breaking Free product can cause pain to other engineering teams, but not really in a financial loss, the company. So we are okay with the risks of just putting it everywhere. This really fits with our cultural trust and we trusted in there and you

need to communicate and use the tool response play, my zero two two thousand clients a week. And we also added that the keishin and we gave all Engineers. Again. We've just wanted us to be in. Everyone's hands. This is all managed under one team. That means that all Engineers could run a tax on all services, which is why we wanted again are we didn't want. These are really tight formations are to be overly cautious. We didn't want any pain, multiple teams with different access levels for pre-production. We felt we get the best results by

making a open accessible. I'm just encouraging everyone to experiment and accepting the risk that come with that are Engineers to make smart decisions and to communicate with other teams. If they are going to be breaking something. And in order to integrate this into our existing work clothes, we create a custom Spinnaker pipeline for triggering kassatex Spinnaker's. The tool that our Engineers are ready, used for deployment rollbacks and ramping up their services. And as I mentioned earlier, we felt that by integrating the cast tooling into existing workflows. We would really impressed

option. And something we discovered early on is that, since we're running the stool in Pre prod, most attacks would need a low test going, in order to stimulate protract an attack, is way more useful when you see the results under a normal load, with a real Trappers closest we can get to real. So despite lying game developers, an easy way to trigger both an attack and a load test at the same time using the same. You either, they're already familiar with and that they use for managing data plans and other infrastructure cast. This is a vending machine that like my early

analogy, you know, they just press a button and they get gas. So, like most things rolling up Gremlin on a technical level wasn't really the hard part 2 getting users to use it. What are the ways to evangelize? Which one try to educate Engineers on how to use it properly. My first of the thing, I really like feed the hungry, meaning work with the people who are most interested in the tool first, you know, Drive early adoption, get it in the hands of people who actually want to use it and don't need much convincing. These tend to be your power users. You're going to get good

feedback, promoted to the teens, use it right away. You have to promote the tools often as well. Every post-mortem every design meeting at cetera. You want to be The Annoying, chaos engineering salesperson in the conversation. Look for opportunities to promote it as a helpful tool. Documenting use cases was a good idea for us. It gives people ideas how to use the tool mixing, your use cases with the technical documentation and he'll plant those seeds on how Engineers can identify a few big use cases at your organization. Put it in the Internal Documentation.

Hold lots of different learning session demo the tool. Talk about half, engineering philosophy, take you. And I again, you're trying to be at chaos engineering salesperson. Cell Chaos. If you find a question being asked more than once document, the ideas that you want to build up the knowledge base throughout the organization. So that everyone's on the same page, not just one special chaos engineering. Ideally, you get to the point where developers get most of the help they need from their drug teammates and from the documentation and they just use the tool without ever having to

engage an s or 18. What do you mean this? Keep emphasizing, experimentation? It's a really important. Part of Catherine with a scientific method report results, adjust experiment, and repeat. And lastly. I offered my time and help designing attacks and working through use cases, self-service as an s or E doesn't mean ignoring development teams. I'm there to help and promote. Good practices. I would push back on actually running any attacks in order to use the tool. You listen to actually use

it during this whole process. Keep in mind that the goal is to, in our Engineers. You want them to feel confident using the tool and causing chaos. So has self-service kiosks, engineering actually helped us. We're still fairly early and rks, Engineering journey. I don't have any good data on if it's helpless creature as solos or reduced reported problems. Do I like to get that data as we could tell you a lot? But here's some good examples of problems before going to production, things to, Grandma and cats engineering practices of mostly the shoes weren't really what we are

looking for the time but have no cover, you know, in the process of experimenting with the tools, the first issue that I remember find it was I was actually just messing around with some teachers and set the system clock back on a few house. A little while later. I saw some chatter about people being confused related to the house. I had targeted. It turns out that when I set the system clock back, some new certificates with a we would have ran into this and production with daylight savings time, but we were able to patch it long before I became a real issue. Thank you just experimenting

with your heart failure. Scenarios. Another issue that we encounter was some weird timeouts in a dependency, one team kept hitting these. These time out. There are unable to find out really where at WISE I'm having a hard time replicating. So by adding a lien c-type Chaos Attacks, we could Target those impacted systems. The engineering team was able to track down where the time I talked to, you was able to implement better back off in the code as well as customer and then validate that it all worked by again. Running those same cast test. Another issue that engineering teams

uncovered that I thought was a really interesting problem was a promise of power elasticsearch recover from a network outage. We are seeing real problems in production or a network flipper out cause significant load on elasticsearch by using the cast tools to Black Hole traffic to the elastic search house. We were able to see that based on how request cute up when the network problem ended cuter password, flood elasticsearch and cause the load problems. This wasn't really the issue. We are expecting to see. And at first, everyone is confused. Why we only saw the problems after the

issues have been resolved after the cast attack ended. But with the cast, okay, we are able to really iterate on this problem experiment, and get a full picture of what was happening, and then fix it. And then again, use the tools to validate that the work. And lastly, it gives us an increased confidence in our platform. This is much harder to quantify, but a lot of teams just use the tool to validate their custom Xbox test there, as well as a spell over procedures problem. They're just making sure everything works as expected, which leads to increased confidence. And this increased

confidence leads makes the development teams work out been supporting their own services and it helps build his knowledge base and Trust the comes with the self-service culture that were really striving for. So now I'm fully rolled out in Pre prod and developers are enabled and encouraged to experiment. What comes next for us? First. We want to just continue to drive adoption while everyone is an able to use it. It doesn't mean everyone is. This is where I continue to just be as chaos, engineering sales person and evangelize the tools and drive adoption

and figure out what roadblocks. There are two developers, not using the toll right now. We want to get this more tightly integrated into our continuous integration and deployment test. If it seemed can automatically trigger attacks on their tests. They can leverage them to help validate, that releases are safe and run. Infrastructure level tasked with their Code test before deployment Without Really any additional mental load or fiction. I also Envision putting Gremlin on production house as well and causing some real production chaos. We've learned that we get the best

results when traffic is closest to prod. And so we did increase our value a lot by running against Real project. Traffic, of course, need to be some more Roblox and guard rails and we currently have a pre prod, but we'd really aim for that cell service implementation, putting the power any chance we would probably break away from our warranty model here so that we can section off permissions to individual service hours. We also want to start running game days and large-scale attacks since individual teams, run their cats attacks right now. They tend to be isolated to

specific Services running game days. Every once in awhile, I would get the whole infrastructure which would unlock another level of value. From all of this and give us a better understanding of our infrastructure as a whole. And lastly, another thing we want to do is start implementing more random chaos, things like setting up an attack to kill nodes throughout the day like the classic half monkey or other things, to mimic real-world infrastructure failures. That does not always plant planned, experimentation is great, but there's a whole nother level of value from unplanned chaos

and real chaos. That's my talk about fitting chaos Engineering in to a self-service culture. Thank you, all for listening. One of the last point that you made was about production traffic. And so one of the questions that had come in was regards to in your pipeline. You mentioned having load tests to simulate traffic. So one of the questions was, what tools do you use to attach regenerate that load. Yeah. We use Gatling for for running a lot of these type blood test each service team as part of just a healthy service is expected to write

a load test for their services. We also deploy a simulator with every service that mocks out some in points for consumers to use to make sure they get like a a stable healthy version of the service in pre-production. So it's just an expectation. For all of our developers to have low test and within the pipeline itself. We're just taking off of Jenkins job that, then triggers the low test to run against the target. Cool and widows load test. Are you simulating just like average traffic or do? You also stimulate like Peaks for, for

planning for spikes? That's a parameter. They get to pass in the default would just be like, normal load throughout the day to make sure it works. But they can pass and different parameters for how many users they want to send me away or what level is closer to get entered, really up to the engineering team, running it at what level they want to run their little tests. Okay, cool, so that just Falls right in line with that pole, self-service idea, right? Yeah, yeah. Where we really just want them to be able to do it all and give them the amount of control that they need to do

it. So some of the other questions are generally more about that, that self-service model you'd mentioned that developers have access to everything. So that was a question of like are there really any boundaries around this or how do you power engineers expected to conduct experiments with respect to other team? Sure sure. Today is David access. Everything within like the the cast engineering tool kit perspective. They don't have like full to all of pre prod. They don't have like

the full route and admit everything that's for like the guardrail that I saw rebuilds comes in expectations to them is just communicate. If you're in be breaking, something understand what your service is impact, understand what an attack is going to impact and no the right places to communicate them. If you're going to be doing something, I'm is not. Then a problem yet. Like everyone talks to them, communicate something, and it's not Channel. Number for the most part, they intended know what they own and they talked to service owners and they are responsible with the tools. Call.

So I guess similar to that is, do you do they ever really test for things like a noisy, neighbor issues? Because it sounds like all the teams are using the shared infrastructure and that could be a problem. And if so, how are, how do engineers account for that? I haven't seen in our services to Cooper Nettie's. And I think that'll become a much larger issue with the Noisy Neighbor. We're working out how that's going to look the way business verticals. And so they just would communicate within those

specific verticals. I think it's the plan there, but we do see that like becoming a larger issue moved to kubernetes Shared most. Yeah, so I'm curious what that move to kubernetes. How is chaos engineering, a part of of that transition. Are you using it in in the migration? I were setting up right now in it. I will be in place before we start migrating any real services and the goal is one to use a taxi hard and the Clusters themselves us running, you know resource type of tax is going to impact entire cluster and host. And then the individual team's we plan on working then to what

the migration plan looks like. So that they are running chaos test as they migrated over to understand the differences between how their services work in the current like Singleton and see one service for one hose type World versus World. Cool. Awesome. Alex been fantastic to chat for everybody else. If you want to continue chatting with Doug will be available in the keyway Doug Campbell channel in slack. So thanks again for your talk. Those fantastic has mentioned. I

do on a shout out to GrubHub because they are feeling part of this conference has. We are on are and don't have time to cook for ourselves or definitely making use of your services. So thanks again for joining us. Yeah, thank you so much.

Cackle comments for the website

Buy this talk

Access to the talk “Self-Service Chaos Engineering: Fitting Gremlin into Grubhub's DevOps Culture - Doug Cambell”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “Chaos Conf 2020”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Ticket

Similar talks

Nate Vogel
Vice President, Data Platforms at Charter Communications
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Yury Nino Roa
Cloud Infrastructure Engineer at Google
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Gene Kim
Author, Researcher at Business
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Buy this video

Video
Access to the talk “Self-Service Chaos Engineering: Fitting Gremlin into Grubhub's DevOps Culture - Doug Cambell”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
843 conferences
34172 speakers
12918 hours of content