About the talk
How do you build a culture of reliability in a massive organization with well-established expectations of how to operate? A common assumption about enterprises is that everything moves at a glacial pace. After growing Charter’s product data engineering team from a handful of engineers to 30, the company implemented a large reorg. This new data platforms group quadrupled in size to over 120 engineers, and responsibility for a mission-critical services platform that backs Customer self-service digital applications and portals. This set of services needed to grow their reliability and Chaos Engineering practice. Nate Vogel, VP, Data Platforms, will share how he grew the data engineering team with an emphasis on building a culture of reliability. He’ll discuss the processes and tools his team used to ensure Charter and its customers have the data and analytics necessary to drive the business. Nate will also provide insight on how to share a culture of reliability in the face of sudden team expansion.
Pragmatic leader who ties a passion for technology with vision and management to drive the development of better software, resilient operations, and true data empowerment. From building and leading engineering teams to architecting and developing robust code and scalable infrastructures. Avid promoter of SRE/DevOps concepts, workflows, and team structures. Comfortable owning complex business critical, highly available, data pipeline and reporting systems. Confident development manager of distributed systems engineering teams. Successfully launched and operated in production a slew of web-scale data integration/pipelines leveraging Hadoop, Kafka, Storm, Cassandra, Spark, Elastic Search, Druid, and more.View the profile
Hello everyone. I'm excited to be the last presenter of chaos, 2020 and it also got the pressure. I can't promise I'll bring everything together for the contents across. This year's concert was fantastic and amazing speakers in a good news about my talk. Is he emphasizes culture and we cover much of the same grounds. It's not exactly a formula but I will speak to what has worked for me as of grown a small to fairly large organization that has a strong culture that
beliefs and resiliency. So let's get started. First off the Enterprise Charter Communications company. I work for and as you may or may not know, it's very large company and there are some very nuanced and large thinks that they're involved in working in an Enterprise soon. Trying to develop software with Enterprises resources. I mean, there's Enterprise responsibility and that is inherently where that Steve Innovation, vs. The stability of the core. Struggle comes
in as you're trying to develop software. Inherently. You've got to pass through a number of systems greenfields is almost did not someone's not a thing in the Enterprise. You're almost always working with some semblance of Legacy systems and other systems and must be incorporated and the challenge. That is one of the main in town just in that. He says many of these systems are operated and John in some cases by different departments that are of multi-disciplined. So many different types of personalities and cultures. And
ultimately, that means that there is a risk management strategy conflict as well as in order to actually continued voice software to manage risk and operating core. Same thing, any any change, whether it's small crate. Inherent risks are one of the subject to individual Engineers responsible for actually achieving them. The other very obvious and often cliche. There's just lots of red tape approvals justifications review boards, contingencies rounds, enterprising
and many cases can slow down innovation. So, what often happens is sweetie, we find ways in Enterprise to work together by working part. So they're mentioned multi disciplines and different organizations and their all, and a built around how to leverage their individual skills philosophies. And then the interconnections are really where things get Tricky, Tricky. Alright, releases, do it when it comes to actually getting software. Cross the line into customers hands. As I mentioned, really
Legacy is everywhere. And in an Enterprise specially one that's been around for 10 plus years. This also undergone a lot of Acquisitions Legacy debt from systems that your team. Almost certainly does not understand. And then outside of your team. You've got other that's a multidisciplinary balancing acts to try to contend with and find the PATH forward that works. often means long Cycles with the local, with a lot of risk management, kind of studies strategies and
trying to come to a comment ask where the minimum risk is achieved and he wants to see ICD, but that is super hard and Enterprise that Legacy dead and all that red tick, be very challenging to kind of Corral and a very Constructive way. So one way to do that is to just chip away at a slowly automated, everything, build great relationships with those teams that rely on your software. Are you rely on there? I'm so one of the things that's really emerged. I my career and at the charter, specifically its is my team started small. I was able
to bring a lot of my past experiences around running operations and devops necessary teams to Charter and I was able to start and get my roots planted and within the last year that we underwent a rework. So it was essentially, I look at that as at the opportunity for executive suit and still the direction. They would like to see your positions go. So they they change from a top-down approach. And what it really does, is ultimately tests the culture that
tested my team's culture. I'm through this merge. We're ultimately, we had an exponential growth since 3220 people. Do quick, introductions Charter Communications is a connectivity Services Company, 30 million customers cable, internet voice, mobile 195,000 employees. And that is our footprint to mail vice president of data platform. So my team encompasses many different strategies for accessing data in a consistent manner of popular term for abstracting way
Integrations, and I'll talk more about that in the coming, slice. How did data platform steam? I see backbone of months of what enables the business to operate and a modern fashion, what? I mean by that is really good decisions. Roadmaps are ultimately driven by data and not always or ultimately on the gut instincts are canyons of those in charge. There are receiving into. It's there getting feedback through and of the many chains of analytics and the team's responsible kind of assembling them. And the date of this is the one that is unifying. A lot of those disparate
threats are pulling things together across many different departments, and are hard for responsibility, is to provide a Consolidated view of that data. To those who need it. Often in our Executives and often even more. So since the reorg a customer service, and one important point is that this has an integration is not necessarily a system sings. Also team, you got to deal with the teams that are either received the source or destination for that data. Conduit.
Unification. Develops and operates really for reliability. The best data platforms to actually use are ones that are reliable, predictable, and boring. Frankly. It's easy to get access, doesn't take a lot of work. It's easy to make requests to make augmentation to request new features, anything of that nature. And so bringing a reliable platform with the ability to be flexible is the core kind of, of of our team. And one of the way to do that, really the only way to be reliable is to have excellent operating habits.
And when we do that, is through shared pain and I really believe that if you build it, you operate it and you're going to feel that pain. That's where is Legacy departmental separation or divide between operations and development is very challenging to Still good habits around getting right back to the developers. So that up front, while you're writing the code, why your development during the life cycle before hit the customer? You got resiliency top-of-mind.
Play a culture of reliability, and I think that is see the Crux of how you actually enable a team to have that mindset of development for liability through operations excellence. And I think I've developed a formula and face of sorts that culminates a lot of what makes a resilient culture work. Formula. First of all, it is more or less that have actions in the face, in more of a set of beliefs. So the actions the formula demonstrate transparency, earn trust game management support being very transparent around
Lowe's. I honestly, if that's in compliance over communicating issues in a relevant fashion, a key to demonstration. Transparency, trust will come forth and it'll it'll come. However, it is very difficult to keep your not consistent with your transparency. Demonstration been very easy to lose one important. Facet of this is there's no Perfection. You've just got to stay resilient insistence. Infallibility is not key. True demonstrating transparency, you going to gain management support? And once you have management support, you as an
operator of a team or at the Lebron team are given more autonomy to enact key principles from devops at 3, gas, engineering me. And I've mentioned this before but it's really build it as though. You're going to operate it. You're going to be given the opportunity to bridge Enterprise organizations with modern development, architectures, methodologies, and practices. Only to do that is to gain that management support and Trust. Again, shared pain and being willing to demonstrate that, you will feel that pain is key to achieving this
formula. A little more about how my data platforms team is actually exercising. Some of our chaos engineering practices and principles. We have a very tight deep water pipeline integration. Our date of iPhone, has the most aggressive s Lowe's of any of my teams. Currently some of that is through maturity of my oldest teams. So we've been ever-present pulling up come up with our drive to very assertive. It's the way this current year is Target at 99.9.
We are closed. I mean we are there, I say clothes because I feel like we're never really done being a hundred percent months today. I took that screenshot the two weeks ago. We're still at 100%. So running a hundred percent month-to-month really the goal. And the only way to do that is to learn from outages and expose gaps, no matter how small use those adages its templates. And we test ogapps every single time, you do a change to your system. Stress, the
whole system and the way to really do that as a tack individual pieces of it challenge your architectures, even when they're not your own. We want a lot of Enterprise distributed data storage and processing Services provided by cloud provider. So they're managed services in the many other open source services that are Enterprise supported. Don't just inherently assume that those are resilient you can do more by testing those assumptions and we actually enjoy doing that very much. Some of the
side effects into enjoyment of breaking things on purpose is, you're going to and we've seen monitoring Improvement every single time. We run it back, maybe 99% of time, tends to be one of the primary side effects because monitoring is something that needs to be developed an iterated and matured. As much as your software self. You also obtain architecture design and put the chicken away that reflects back on infrastructure. Notice sign of a weak link or when is not quite staying at the respected from the are pie chart. The black hole testing is by far
the most entertaining and has revealed the most issues for us, while we continue to run it to him to some of the examples of Or rather an example of a change that we have inherited was we went from multiple clusters in rushing real-time across several pipeline to ultimately creating separation and executors and clearing cache. Regularly. This was just something you would be done as far as well. Turn out to be quite the case. Last week's I'm added. We've added over a hundred
metrics in the last 2 months and 10 monitors plus countless run books or we call the mops method of procedure changes. I need your key to smooth operations and smooth did not necessarily mean that we are infallible. It just means we are detected shoes. Quickly. We can react even quicker. So as your team-mate grow through success, or otherwise scaling culture, very very challenging. So. Well, my perspective is is on killing culture within an Enterprise. I think that these commonalities are across a Denny's or an engineering team that
experienced growth. My case. We have 90 new people join and essentially a day for that preserving core values. Very hard. So what I did, what we did is management team take time for proper onboarding and treat new people like, you know, why people respect their history. Not assume that because you happen to be in the inheriting group, in your culture, gets to be the majority that you're doing it, right? That you're better. It's very important to exercise empathy and continuously Express core values. Important to your culture. There will
be conflict. There will be difference, isn't the tallest standards of operating. Just be flexible, be empathetic. Again, circumstances vary wildly just stick to your principles. I'm personally have been told some of my best advice was on your style. Be very cognizant of what has worked. Well, but don't be rigid, avoid a cultural Playbook, make small changes very frequently, establish a common set of questions, but don't stick to them like that. You know, it's like a script
be very attentive to warning signals and I'm squeaky Wheels can be very toxic to a culture. Introduces new ideas, gradually and championed better ways to work. If you happen to be bringing on people who don't have a strong sense of devops, that's already chaos engineering. Introduce them to it actually an easy sell for most Engineers that you understand the benefits mostly almost immediately. They can relate and maybe even empathize because of the past
pain. So just be there to reinforce those Concepts and also provide kind of new ideas measurement. If it doesn't exist as strongly as it ought to only way to really have a culture that you can access as working is through measurements. And quantification this extremely difficult. We often rely on qualification for quality of culture, but I would say that quantification is just as important, especially as you move up the executive chain or people reflect on a performance. In past years of particular, teams are looking at data, they're
looking at hard facts, so you need to measure So, on the docket for data platforms. Really, we need to continue to iterate. More personal infrastructures embrace, the ability to experiments which has been mentioned. Several times at this conference is chaos and experimentation for developers. It's just testing failure as an experiment learning from it, developing for my building new ideas of Vacations, or I maybe even new Pathways for function. Increasing your release is key to actually in a line that experimentation on us talking about like a TB test thing.
You need a framework that supports experimentation a couple examples that we've leveraged loan. Apogee. You need multiple prod version Pathways for resiliency, and that's a fancy way of saying. And in order to really do that and pull together, all the concepts mentioned, and many of the other speakers have mentioned, you've got to institutionalize, operational tooling procedures, and really allow the organization to trust your process or transformation of
their Core Business. Observability. We have a long way to go. There's so many areas that could expose more insights around what is happening in our services? We run a very complex set. Well, one of my teams runs very complex set of a service mesh in applications and the interactions between the services as well as the Upstream Downstream are mind-boggling and without proper observability actually captures. Tracing and logs in a comprehensive way then hopped in the production environments particularly
by a team is going to take a very long, resiliency very difficult to achieve because your beliefs is what difference does your environment? Look difference. So focusing on consistency and driving that towards a better observability is it's where we need to move more and more toward now, think this is something I've ever truly time. Finally, and really part of the others. In the touring accounts, engineering practice automated, we need to automate every single test that we do. As part of our development cycle. Love game days, fun. But really automatic chaos cast.
As part of his little process is what a mature organization is able to do. That should be doing. Go beyond host attacks specifically where it was very fond of attacking Oculus servers or containers, but there are many tools allow you in a great rest and apps to intervene and interrupt some service to service or even functional processing. And I think that about summarizes everything. I'd like to say today. Any questions. Please reach out to me on LinkedIn. It's been a pleasure.
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.