Chaos and Reliability: A Surprising Friendship in the Enterprise

Chaos Engineering is often characterized as “breaking things in production,” which lends it an air of something only feasible for technologically elite or sophisticated organizations. In practice, it’s been a key element in digital transformation from the ground up for a number of companies ranging from pre-streaming Netflix to those in highly regulated industries like healthcare, telecommunications, and financial services.


Many enterprises are grappling with application modernization at an ever-increasing scale, and leveraging chaos-informed experimentation as a facet of their SRE practices can help them get their arms around the complexity of their systems. Understanding the complexity of distributed systems is foundational but critical to true observability. These practices can inevitably lead to clarity in metrics like SLOs, grounded in reality instead of guesswork.


In this talk, Troy Koss (Director of SRE at Capital One) joins Courtney Nash (researcher at Verica) to explore some of the myths of Chaos Engineering, and how he’s put it into practice at multiple enterprise companies to foster a culture focused on reliability. Join them to learn how not chaotic it can be to adopt chaos engineering and how effective it can be at accelerating your SRE journey. You might be surprised to find out how close you already are to getting started...

TK

Troy Koss

Director, Site Reliability Engineering (SRE), Capital One

CN

Courtney Nash

Senior Research Analyst, Verica

Transcript

00:00:12

Hi, I'm so excited that everyone from DevOps enterprise summit virtual is joining us today. I'm Courtney Nash. I am a researcher at a company called Verica, and today I'm here joined by. Troy will introduce himself in a second to talk about chaos engineering and reliability in the enterprise to unexpected friends. Um, and I'll let try introduce himself.

00:00:38

Yeah, I'm excited to be here. Um, for I'm currently at capital one, we only brasserie organizations here. Um, and we are excited to share with you kind of the experiences that I've personally been through and some of the myths and whatnot that are part of chaos engineering and hopefully get you comfortable, um, with, with embarking on your own. Cool.

00:00:59

So this is when I do the share screen part, everybody gets to enjoy the awkward presentation stuff. Okay, that's us. We just talked about ourselves, so we don't have to hang out on that screen for very long. So a couple of myths before we get started, um, because the very name is a bit much chaos engineering does not sound like something that most people want to get going with much less than the enterprise, but really if you think of it in terms of experimentation, then it becomes a much more approachable, uh, thing to be considering. And it's really a practical experimentation that helps you get your arms around your systems. Um, most of us, I would assume are here because in part we're building and maintaining and operating very complex systems with high business and production pressures, and no one person can get their arms around how that all works.

00:01:50

Um, within an organization that has, you know, as decomposing monolith to microservices and has upstream providers and cloud providers and all kinds of things, it's just too much complexity to be able to hold it all on your head. And so this process of experimentation tells you ideally like sort of what the boundaries are like, where's the cliff are, where are the cliffs, multiple clips? Are you driving at them 90 miles per hour? Or are you slowly wandering towards them? It's really hard to know, um, a lot of the time. And so the goal of this is to just get you comfortable with experimenting on within those systems in safe ways. So myth wise, that name gives us a couple of things that we wanted to cover. Um, and along the way, Troy and I discovered that we're both giant plant nerds, um, and like, you know, flowers and stuff like that.

00:02:37

So you, you get plant metaphors and maybe some goats. So the first myth of chaos engineering is that it's an advanced some kind of, you know, mythical, advanced capability, um, Netflix, uh, Amazon big organizations. Um, you gotta be like that scale to be able to do this. And it turns out it's actually the other kind of way around, or that's the, that's how they got there. Uh, so chaos engineering, the discipline was sort of born out of Netflix's transformation from the data center to the cloud. And I don't know if people remember this, but there was a period of time when Netflix is, you know, availability, reputation was not good and things were falling down a lot and people were pretty mad. They couldn't watch their movies and their other things. And so folks at Netflix started hunting around, like, we've got to do something, how are we going to be able to do this?

00:03:26

And that's actually that process of figuring out how to experiment safely on their systems and then expand the sort of the size and the scale and the sophistication of those experiments is why Netflix is Netflix now. Um, and so it's often sort of viewed in the wrong cycle. And so they really started there and got to reliability. They have now by experimenting on their systems and other companies have sort of started to do this as a part of digital transformation journeys, which is why we're here talking to you all today. Um, including those even in highly regulated industries like healthcare and finance and banking. And sometimes when I say that to people, they don't believe me, which is why Troy is here, um, because he has, um, which is super cool. So he's going to talk a little bit about his experience, um, with this first myth and I'll let him take it away.

00:04:15

Yeah, well, you definitely don't have to be a master gardener. Um, and the one industry too, as well as if you didn't mention was telecom. Um, I was, uh, I was at, uh, Verizon and telecommunications industry. Um, and we were growing rapidly and still are growing, um, as a company and, and modernizing our applications, you know, the, the, the monolith, the microservice architectures, the cloud data center to cloud journeys. Um, and one of the things to deal with that the complexity was found in grounding ourselves in a site, reliability engineering program. And it was something I was fortunate enough to be a part of kicking off and getting started, uh, at Verizon. And we really used it as a practice for us to try and change the way we think and ensure we have the reliability that we were known for, uh, as, as Ryzen, aren't the most reliable network, right?

00:05:01

I needed the most reliable culture and most reliable practices to, uh, to embrace. And, um, it's really that, that shift of moving from a reactive state where we're on and always on, and everyone's happy to we're off often, there's an incident and how many incidents and how fast we get up and like, and then it goes back to, okay, we're on a working, but like really how well are we working? That's that proactive shifts in understanding our systems and getting ahead of that and measuring, you know, that, that proactive in a proactive measure, like how well are things going? Um, so before, and as we started embarking on that, if you can pick up the next lot of real quick is we noticed that, uh, even that was a hard journey, like adopting us alone and getting, getting going. And we'll talk a little bit about that in a little bit later.

00:05:44

Um, but cast engineering became kind of a, uh, quick and easy dare. I say, easy way for us to embark on this, this real system dependency, understanding and comprehension. Um, you probably find out that you don't know all of the edge cases and how your system works and, and whatnot. So, uh, what better way to do that then to run verifications, to see how things behave, uh, when we're in the Kubernetes space and moving to containers. Um, and Kubernetes, the thing that we focused on was how can we build a reliable kind of like correlate cluster configuration for teams that kind of meets our standards and needs. And are we actually doing that? What we think we're doing, um, and verifying and running tests and verifications and hypothesis on that, uh, consistently we find that, you know, in, in some cases, uh, your poll time for your images, you'd expect like a small image versus a large image would pull faster than one another.

00:06:39

Um, and then when you run verifications to test that you find that that's not the case, and you see that the whole times are, are sporadic and different. And then you find out that there is some network configurations that are happening and you're going from different VPCs and moving over. So, uh, it's such a good learning experience and it's pretty rich and, and that nature, um, there's also a space that's happened too, for understanding if you're secure and you're safe and reliable, um, as a part of reliability engineering, right? And looking at, uh, images that you're deploying into your classroom and seeing if they're vulnerable, uh, if the vulnerability detection that you have in place is actually working the point, known, controlled, vulnerable, um, images into your, into your clusters and seeing that you have the, the knobs turn, right. And the threshold set. Um, and oftentimes we found that we didn't, and that's okay. And that's what we learned. And we got ahead of it again in that whole, like, shift of proactive, um, first reactive nature. So,

00:07:33

And I want to ask you a question, which is, you know, you sort of mentioned you're, you're running Kubernetes at you're running at Verizon. Um, what was, would you say, like either individually team organization wise the sort of maturity with that particular technology at the time?

00:07:51

Yeah, I'd say like, it was probably on like the, the more like introductory novice, like intermediate level, um, you know, getting to that advanced level in containers and Kubernetes in particular orchestrating that comes with a lot of time and experience and, and really working on the systems. And, um, it's a skill set that's like highly sought after, and people are evolving and growing into. Um, but we were definitely early on and, and I think it's, it's understanding how things work and like what happens when I know it goes down and does my application scale the way it's supposed to, and when we moved to containers and like, did we cut all the things that are a staple or out, you know, things like that, that you discover, um, and moving, but yeah, it was, I would say to answer your question directly, it's pretty, pretty early on in the journey. Um, and I think a lot of places are everyone, everyone I've seen is pretty early on.

00:08:41

It turns out I've seen the same thing, um, or at least the smaller, I, I, I read a small survey a few months back of about 50 organizations, um, that we'd had some contact with potentially, or had gone out to reach out to that were working in, you know, the reason either Kubernetes or Kafka, and trying to understand, again, sort of the maturity of which people are dealing with this. And I was really surprised to find, first of all, that one of the biggest chunks of organizations we talked to were really big, you know, 10,000 person, you know, basically enterprise types of organizations, um, across, uh, we had a pretty good range of roles, but you see, you know, folks that you'd expect to see in there. Um, but the thing that really surprised me was like the maturity of experience with these kinds of really complex technologies that people are using, like in full scale production systems at 10,000 person company.

00:09:35

And I was like, wow. So, um, you know, some, some people are probably like, oh, terrifying, but I mean, that's just, that's the state of the industry right now. We are trying to grapple with the complexity. Um, and we're using tools that both help us do that and that add to it, you know, at the same time. Um, and so I think that hopefully I don't, um, you know, completely destroy this point and the, that people are sick and tired of it, but you just, you don't have to be a master gardener. Most people who are looking at doing this are really starting at it pretty early on, which I would argue is the better way to go. Um, and speaking of going or goats, that's what we get next. That's myth number two, which is that chaos engineering, as I just said, right. We have pretty complicated sometimes chaotic systems, like why add more?

00:10:23

That seems like a terrible idea. Don't do that. Um, and so this point is really about how chaos engineering isn't about adding chaos. It's just seeing the chaos that's in your systems. It's, uh, it's letting the goats run wild. Um, and, but like in a pen with, you know, you can see them they're in a pen, so that I'm done with this metaphor. I'm sorry. You all are probably really sick of it. Um, so I will try to turn it over to Troy to talk about his experience with, um, chaos in systems he worked on.

00:10:55

Yeah, yeah, definitely. And you know, that, that notion of, of we're going to introduce more chaos. Why, why we need less chaos, why don't we need more chaos? And it's really like, you look at the definition and what we're trying to do. It's, it's like you're preventing the calves. You're, you're, you're trying to get ahead of cash. You're trying to get ahead of the unknown. Um, and some methods and means, um, you know, as a part of like an SRE journey that many of many teams are on are digital transformations is, is adoption of, of SLS, not to be confused with SLS, but SLS, um, and, and having those as a consistent way to measure our systems and our services, to be able to, you know, know the bounds of like what we can experiment with or what we can't. Um, and are we meeting the customer expectations and, um, you know, we actually didn't even have like formal SLS at the time, um, when I was at Rajan and, and adopting, um, chaos engineering.

00:11:44

And in fact, we used tasks engineering to help us look at SLS and understand us a little bit more closely, um, and running verifications to find out where should things be set to in terms of the rest of those, should it be 200 milliseconds, two 50, like, and that's what everyone tends to unfortunately gravitate towards like nice even numbers. Um, but maybe it's 186 or 187. That's one thing, you know, um, that we discovered in doing that is what, what are the appropriate ones, like finding, finding us fellows that are set in. And we were chatting about like grounded in reality, right? Of like, what is the som need to be, um, rather than just guesses and, you know, a swag to say the least, um, at capital one here where like we're developing a lot of tooling to, to, uh, have us close and put us laws in place, um, agnostic tooling that can handle the ever-changing, you know, tool dilemma of like what tool we're using, you know, today we'll go flavor of the week for APM, but, um, building tooling, we have us a Lowe's in place. So we can start embarking on a lot of these things, like in CAS, in running care's verifications to, to, in experiments, to understand our systems and learn from them. And, um, but having consistent measures in place, um, to ensure that that myth that you just spoke about, about introducing work tasks doesn't happen. In fact, that we're, we're within our bounds, we're within a safety net, a responsible place to be. Um, and, uh, it's, it's pretty good to start adopting, you know, SLS as a, as a measure to help with that.

00:13:10

Yeah. And I mean, I think the phrase you used when we first talked about this was it allows you to experiment safely, which, which I really like. Um, and so to that end, to, to begin experimenting, I want to we'll stop at the mess and we'll start with, what, what do you really need in place? Um, cause I, I get asked this a lot and a lot of the things you've explained to me in your journey tray, I feel like really hit these particular points. So the first one is instrumentation to be able to detect some sort of either degradation or lack thereof in your system. Um, and I think a lot of times this hints to the first myth, which is that people think they need to have really sophisticated, you know, sort of observability systems or whatnot, um, use what you've got. And even to that point, Troy just said that that at Verizon, they didn't have that solos yet.

00:13:59

And so, you know, you don't have to even be that you don't have to be at that level, which is maybe a lot of organizations aren't necessarily there yet at having those set. Um, but so just use what you've got, whatever kind of logging, tracing, what have you used that? Uh, and you'll refine that as you go to, uh, as, as Troy said, you'll find out what's working there and what's not working there as well. So the next few prerequisites w we're going to do a little more of this, this back and forth, um, again, and so, and the second one, we get to have a little bit of a chat and we're goats, um, because really who doesn't love goats. So beyond instrumentation, you need social awareness, um, which this one particular goat definitely lacks. Um, it's really important to be explicit with everyone who might be involved in terms of what you're doing to what end the expectations and the outcomes.

00:14:47

Um, chaos engineering sounds scary. It's going to be you're to, if you, if you're already bought in on this, let's say you might not be Troy wasn't. Um, then you're going to run into some resistance like this and any, a lot of changes, but this one sounds particularly nerve-wracking. Um, and so not telling people sometimes tempting, right? Like I'll just go run some experiments and then it'll be great. And then I'll show people the results, except it might not be great. Cause you don't know that's the point. Right? So, um, you really do have to sort of build the beginnings of people, willing to go on this journey with you. Um, and it's really easy to talk about that in the abstract and for me to be like, yay, you know, do this, but people are often like, no way I don't get it. How, and so I really want to hear from you Troy, about, I mean, I know you had your own personal sort of trepidation about this, but then organizationally, how did you all actually take that first step of what did that look like for you?

00:15:45

Yeah, definitely. I and I, and I echo the sentiment that you're you're giving off, which is like, you don't want to be that goat that you're showing on the screen. There, you don't, it'd be that bad one. That's just nuking things and like talking about graders, but, um, yeah, there's a few things to keep in mind. Um, you know, one is you don't have to start in production, you know, understand that that's an evolution you get to, like, you don't start there and start just doing things there, running your verifications there and your hypothesis there. Um, you want to keep your scope small, you want to be able to keep it. Um, and then a limited fashion, as I mentioned, we focused on like the Kubernetes platform itself and the underlying infrastructure and how that, you know, the orchestration of the clusters. And, and that was a small of scope place where we were, we had dependent parties involved, um, but it was smaller.

00:16:29

Um, and we were able to articulate the blast radius and like contain the blast radius as well. Um, ran, run simulations on like testing some of your hypothesis out with like fake data and other big systems just to like prove it out and to understand it. Um, and one thing that I do want to hit on those, like while I say keep things small and contained, uh, you definitely want to hit things that are effective and I'm coordinating, we'll talk about that in a second here, but, but make sure that the work that you are doing is, uh, is definitely something that's meaningful and you're working on systems that definitely matter. Um, I think the, to get that buy-in and to get that value, uh, is the, the low hanging fruit are fun and easy to, to get. And those are, those are nice, but make sure that you definitely address some, you know, things that matter most to the enterprise, things that have that value, um, tied to them.

00:17:19

Yeah. That's the perfect segue. Thank you, sir. Um, so I think I wanted to take a minute to talk about hypotheses because it's easy to throw these words around and, you know, unless we talk about verifying verifications or hypotheses or all these words, um, but there's some science to this and, and the notion of experimentation is, is rounded in that longstanding tradition. And so th in, in my opinion, that really key part is what that hypothesis is, right? So you have, you have a control state, you have something you're going to, you know, some perturbation you're gonna introduce and you have a hypothesis about what's going to happen. Um, if your hypothesis is that broken things are gonna break, then that doesn't really help you. And also the, the, the point of this is to understand your systems better. If you already understand that about your system, then you just spend a bunch of people's time confirming something you already knew.

00:18:09

I understand sometimes you want to do that, so you can get like buy in or budget or whatever to fix it. I totally can relate to that. You'll get there. Um, but I think let's try saying you'll get there by showing people things they didn't know about how their systems work. Um, and so that's, those hypotheses should be things that uphold expectations. Cause then when those aren't right, then it's like the light bulb goes off for people, right. So if you do have SLS, um, you might have a hypothesis, your statement along the lines of like this service will meet, you know, X, Y, Z SLO, even under conditions of high latency, like in the data layer, whatever, and that's should be contextual to your business, to your customers. Like that should make sense. Right. Um, and then if it does great, and if it doesn't, then you're now really, and that's why if you have that solos, it's a great space to play because SLS are directly about those kinds of business critical, um, outcomes.

00:19:06

Right. Um, and so I think that's a really nice alignment is if you're playing around in SLO space, then you're really doing something like Troy said, that's actually meaningful to the business. Um, so have really well-formed hypotheses that are, that are meaningful and contextually relevant. Um, so that, that also means, like I said, like the, I want to go back to the don't break, fix broken things bit. Um, but, and, and finding things you didn't know about your system. So, uh, Trey has a good story about that. So I'm gonna pass it back over to him now.

00:19:40

Yeah, definitely. And just to reiterate one more time, but for the third time's a charm is you don't need the SLS to get started. And in fact, like I said, Catherine can help you get there. Um, but they're definitely a good enabler, uh, to Cory's point. Um, but which, and other point that you made earlier, but it was like, you know, you said Courtney was like, you really don't need a lot. You have metrics, you have logs, you have alerts in place. We have, you know, teams are trying to adopt some sense of observability for their systems are respectfully. But, um, you know, sometimes when you run the hypothesis, like I mentioned earlier about a vulnerable image and you'd like, well, no, we'd know, we stop all in Roma just, and then you put them all in there, pull image out. And it actually deploys.

00:20:14

And you're like, actually we didn't. So you find these things and that there's like true value, especially in the security domain too. Um, but you know, one thing that also can come as a byproduct of it is, uh, you run your hypothesis. You think that you have the necessary alerting and safeguards in places instrumentation that you've always had, you know, like, you know, we have our alert policy and it will go off when bad things happen to our system. And when you start running, uh, chaos, experiments and verifications, you learn soon that sometimes your alerts weren't set correctly and that they don't go off. Like you think that they go off when things happen, um, as you're running these verifications. And, um, it's a good thing. It's a good thing to find out that those things are out of place in the controlled environment where you're running these experiments, um, rather than when you actually have a production outage and you don't know you have a production outage and their alerts don't go off and then their MTD and your MTTR become like chaotic.

00:21:03

And then everyone's like scrambling to get a resolution. Um, finding them out in these controlled environments is like a super great place to be. Uh, you get two takeaways, you, you aid, you learn about how your system actually responds during the verification during that verification, whether it's, like I said, you, you mentioned injecting latency into your requests, whether it's, uh, taking down nodes and seeing how things respond and that how long time to taste for applications to re redeploy, um, et cetera. But you, you find that, and then you also found out that your alerts weren't good. So your observability as a by-product becomes like enhanced and enriched. Um, so it kind of like all ties in, um, to that whole like culture of like re re, re reevaluating, and constantly being able to like assess your system. Um, and it's all, as I mentioned at the beer beginning is that, that shift from reactive to proactive and being able to get ahead of, you know, when that thing happens, when that, when that event happens, that we are so fearful of. Um, but yeah, I echo your, your remarks.

00:22:00

So the, this is, you know, that, that reactive, proactive dynamic that you're talking about that shift, the last prerequisite for chaos engineering is being able to, I mean, ideally you're proactive, but at some point you've got to react to the experiments, you know, and the results of those. And this is the flip side of the coin of buy-in. The more work you put in upfront on the buy in front, the more likely you're going to have alignment to actually respond. And this, this may sound obvious, but this can be where chaos engineering efforts die on the vine. Um, because teams are busy. Um, we have a lot of work to do the thing you did might actually have impacted some other team or some downstream thing. And now you got to, you know, kind of get those folks on board. So I feel like it's a really, almost obvious, but critical prerequisite that it's a cult, it's a big cultural change.

00:22:55

Um, which most of y'all here should be pretty familiar with trying to make happen. I think the thing that's really great about this one is like Troy said, if you can start small limit the scope and the blast radius and everything, you get a good vicious cycle going, right, where people see the benefit of the experimentation, they put the changes in things, you know, you can ultimately you move on to bigger and thornier things is really, you know, how that works with that point. Hopefully you've basically put that like cultural infrastructure in place where everyone is actually excited about this stuff instead of terrified by it, um, and sees the benefit of that experimentation. So I like to refer to this as your, I said, cultural infrastructure. We like to talk about our cloud, other kinds of infrastructure a lot. Um, but this one is also incredibly important. And so, uh, nurture, and, and don't forget that other people are going to have to get involved in the implementation side of it and be prepared to sort of help them do that. Um, and that's, that's our prerequisites and our myths. And so I will hand it over to try to close things out, um, with some final thoughts on chaos engineering and enterprise.

00:24:07

Yeah, definitely. And, and the whole cultural piece that you just hit on is, again, that is what SRE is it's so yes, so those practices and there's toil in terms and buzzwords and all the good things that come with it. Um, but it's a culture, it's an approach. It's like, how do we address these problems, uh, in a consistent way. And, and all of the things we just discussed about chaos engineering and the kinds of experiments you can run and all those different sorts of things. It's, it's really just part of it. It's really part of SRE in my mind, um, at least how I've defined it. And in that you have to understand your systems, you have a vulnerability, resilient architecture and all these patterns, um, and think that we're following, uh, as our critical, intense, uh, as a part of our SRE program, um, here at capital one and, and cult CAS engineering as a part of that, it's, it's part, it's one of those intents.

00:24:50

It's, it's it, there are teams that are going to be have different levels of maturity, um, as you embark on your Esri journey and it's, as you can think of it as like a menu, I always joke as I probably like, cause I like food, but, uh, there's a large menu of items that you can dive into on your Esri journey and, uh, pick what actually works for, for the teams and Cass engineering should be one of those, um, because there's different levels of maturity there isn't like a particular way to do it. Um, but if you have set out the items on the menu, um, start early, um, and don't wait to understand your systems. You know, you ultimately want them to providing a better experience for your customers and a more reliable experience, uh, which is most important.

00:25:28

Well, thank you so much for joining me today, Troy. Um, and for everyone, for joining us and for putting up with goat jokes and plants, not jokes, just plants are great. Goats are jerks, um, and that's it for us. And, uh, I hope you enjoy the rest of the conference. Thanks.