Las Vegas 2019

The Last Mile Continued - Incident Management

In the follow-up to Damon's 2018 "Operations: The Last Mile" keynote, this talk will examine incident management in the era of DevOps and SRE. Responding to incidents has always been the core job of Operations.


Today, the influences of DevOps and SRE are changing how Operations work gets done, and even who is doing the work.


This talk will look at how high performing organizations are applying DevOps and SRE practices to shorten incidents and reduce escalations. Less frustration for the engineers. Lower costs for the business. Everybody wins.

DE

Damon Edwards

Co-Founder, Rundeck

Transcript

00:00:02

My name is Damon Edwards, um, from a company called Rundeck. And just so you know, um, the slides are already online. It's a lot of content in the slides. So if you wanna take a photo of this slide, you can also that's my Twitter. If you want to tweet at me, I'll post the slides there as well too, but you get all the slides from that, uh, that link. And then obviously there'll be in the conference SlideShare as, as well. How many folks saw my talk last year? Some I said good. All right. Thank you. Thank you for coming. And you're still here. Appreciate it. So my thesis last year was that operations is really the last mile. It's the thing we need to unlock the full value of these dev ops transformations. All this work we're doing, we're not get the value out of it unless we unlock and transform how we do.

00:00:45

Operations talked a lot about, um, you know, the issues we have with silos, with work queues, with access of toil, low trust. So that talk is online. This is, this year is really digging in a little bit deeper in going to talk more about incident management. Why? Because the ability to respond and resolve incidents is the true indicator of an organization's operational capability. This is where the rubber meets the road around is how we can respond to and handle our incidents. So a little bit of a definition first cause the there's a lot of definitions of what an incident is in the world. In fact, maybe even in your own organization. Um, so I look at it as an incident is unplanned disruption impacting customers, which pretty obvious right. Or business operations, right? So the customer side is pretty self-explanatory, um, outages, service degradation, kind of what we classify in the classic ITSMs world around incidents, but also look at as unplanned disruption in our business operations, right?

00:01:49

So work interruptions, delay waiting short notice requests. I'll use it for euphemism as nobody lets you know, until, until right when it's necessary. But all those things impact people's work that work eventually bubbles up into delay at the delivery and business level. Um, we are avoiding fixing technical debt fixing problems that could avoid future problems. Uh, so you know, to me, these, these disruptions and business operations end up being in some way or another customer impacting assist a little more diffused. So why would we separate these things out and say, well, let's talk about just the outages and service degradation as one thing. And then the rest of it is something completely different. It's still the same people doing the work. We're still causing the interruption in the blast radius in the organization. So that's my, that's my definition. And a quick show of hands.

00:02:38

How many folks who are actually working operations, keep your hand up. How many of you go more than an hour without somebody interrupting you, asking you to do something or, or something didn't plan for how many go more than can go more than four hours to take your hand down and put it past this? How many can go more than more than a week without somebody interrupting you? Anybody, nobody say this is the world we is the world we live in. Right? So those, those, those, those interruptions are just as important as, as anything. So the format of this talk, I'm going to talk about the life cycle of incident management and sort of how I see a lot of high-performing organizations, uh, transforming and attacking this. Uh, this problem is a lot of people I'm going to, I'm going to mention or, or reference.

00:03:19

Um, they may not agree with everything I'm saying I'm not supplying. They they're endorsing this. Um, and there might be people I left off. Um, please don't tweet at me. Okay. So, uh, but before we talk about that cycle of, uh, of an incident, I wanna kind of talk about the context that we're living in and you know, the stew that we're all marinated in. Cause I think this is very important to, to, to, to understand where, why we've arrived, aware we're at the first one, digital transformation, nobody grown right. Please keep it to yourself, but you know what, what's going on with digital transformation? You know, there's a lot of definitions out there probably more than dev ops, right? And if you think about it, what's going on, it's this impulse from the board level and I've seen enough of these communications from the board level down to, uh, to the technology organization.

00:04:01

What are they really after? Right? Number one, everything's gotta be integrated, right? No more of this customer service agent does something on this machine than this machine. Then this window, then this window, they want to see it all integrated. They want to have one common common system for things. Um, they want things to be responsive. They don't mean web responsive, right? They mean that they want to be responsive to the industry. They want to be responsive to, uh, to the competition, responsive to customer requests. They want to feel like the IC organization has that, that, uh, that responsiveness, not that it just kind of goes into a queue and lives for a sit there forever. Um, they want it everywhere. They want on all the devices. They want to happen on your mobile. They want it happen on your laptop. They want it happening in your, you know, your Siri, right.

00:04:39

And they want things, uh, always there. It's gotta be on, right? The idea of maintenance windows in 2019, our, uh, our, uh, an afterthought. Right. Um, you know, and so all of this is flowing down to the technology organization. I think, uh, Cornelia Davis, one of the speakers here does a great job in her book, uh, cloud native patterns of breaking down, you know, what does that look like? What are the, what is the digital transformation flow down to the technology or organization look like? So that digital transformation has driven us over to this idea of these cloud native technologies, right? And there's been this explosion of new architecture and new platforms. Um, you know, John Willis and Kelsey Hightower are two people. I follow that, do a great job at deciphering. What's real. What's not real, obviously Kelsey's a little more Kubernetes focused than most, but it still is a great job of bringing out the, uh, the reality.

00:05:24

So we've seen this explosion in new platforms, but really what that has enabled is, um, this, uh, you know, and if you've seen this yet, these, these Deathstar diagrams, right? So Cornell, Cornell university does a great job of trying to do this. Visualization of microservices is actually for a popular online service. And so the, the, the, the, the complicated, uh, world of these microservices is exploding. And then, you know, I think a great, um, I think one of the best explainers of what this means to the companies is Adrian. Cockcroft right. He did a great talk at Docker con back in 2014, talking to Mount that architecture enables speed, right. That really what we're after here is being able to decouple the organization. If you decouple the organization, they can move faster. Speed. Is that competitive advantage that is being driven down from that, uh, that, that board level.

00:06:13

So at the technology level, we're pushing for that decoupling, we're pushing for that, for that speed and that ephemeral nature of our infrastructure, which is endless all the way over to dev ops, right? That's driven where, how are we going to manage these people? How are we going to take advantage of all of these, uh, these, this new, this new infrastructure, new capabilities, um, you know, we're here at jeans, jeans party, so might as well, uh, you know, reference gene here, but, you know, he brought us to three ways, right? Taking a lot of people's work and bringing it together and say, look, it's about this flow, fast feedback, you know, feedback loops. Now the unicorn project, we've got these five ideals, right. And the reality is that a lot of that has, has been interpreted as is it's all about, you know, dev and it's all about the go go, go, right?

00:06:53

Uh, you know, what about operations right now? That's the thing that's still burning in the, uh, in the background. And that was the whole point of my talk last year. So if you want to go back and watch that, it's, uh, it's on, uh, it's on, uh, it's on YouTube, but, so what is that, that, that, that pushback, right? If we all go, go, go, and I think that's really come in the form of the SRE movement, right? It starts to provide that feedback of the system to say, what does operations response to this? Go go, go nature of the dev ops, uh, the dev ops world, um, Ben trainer at Google, um, he was the first person to coin the term SRE put together the first SRE organization, but it's really following patterns that a lot of cloud native organizations are, are following. And they really bring about these principles, right?

00:07:39

Like SRS need service level objectives with consequences, right? SOEs have time to make tomorrow better than today. SRE teams have the ability to regulate their workload. It's all about that, that feedback that we're, everything's go, go, go towards ops. How do you provide that feedback to stay in, in control? And, uh, you know, folks like Tom lemon, Shelley Steven thorn, Liz Fong Jones, Neil Murphy, all people I follow in this area who are doing a great job of really kind of surfacing. Um, what's special about this, uh, this, uh, um, this new world. And, uh, there's a third, uh, O'Reilly book, uh, called seeking SRE, which I'm going to plug cause I actually wrote a chapter in it. So, uh, I think it's the best SRE book book yet. So, but w really start to put this together, right? You what's actually happening here. If you see that what's going on is, you know, things like our product, not project, if you're here for mixed talk, he got a lot of that, right.

00:08:27

Containers delivery, shifting left, air budgets, toil limit, cloud, native technology, so on and so forth, you know, it's really building this, this self-regulating system, right. We're breaking down our world, decoupling into these horizontal streams like Adrian was talking about, and we're really building self-regulating horizontal, um, horizontal systems here. It's John Hall from a BMC is actually the first person to really point that, that characteristic, uh, out to me. And you know, whether you're in these pure cross-functional teams or you've still got a classic Devin ops organization, it's about these value aligned building value lines, self-regulating systems, and building shared responsibility models between people in air quotes here, dev roles or ops roles to, uh, to balance out their, uh, um, to balance out that, that work. Now, you know, let's compare this to kind of what a lot of us grew up in, right. Which is more of a traditional, it SM focus to the world, right?

00:09:18

Where everything was about the process, right? And the process got a process owner that process had inputs, outputs, triggers, metrics. What do we do? We signed them up and coming manager saying, this is your process, right? Here's your metrics, here's your triggers. Here's your, here's your outcomes. We want go manage that thing. Right. And they sharp elbows are going to manage that thing to the best, to the best of their ability. They will be the best firewall rule changers west of the Mississippi. Right. Um, and you know, then you get something like I tell from, you know, starting in 89, they clearly defined 26 of the, or a formally defined 26 of these processes. Now they call them practices. Right? What ends up happening is, oh, then on top of that forgot, uh, this notion of change authority that there's some external body that's granting the authority for you to make change.

00:10:03

Somebody else is going to tell you whether or not your changes is going to be correct, or they'll say, oh, we'll just bring it to us. And we'll sort of give you advice on it. But in general, the idea of authority is flowing from some place. Right. And what happens here is with that idea of these kind of horizontal view of the world, we're unintentionally encouraging these, these silos. People are people they wanna, you know, achieve their, their KPIs and their okay, ours. And we end up this kind of unintentional silos that break the flow of, uh, of work. And then on top of that, we're encouraging whether it's unintentional or not remains to be is, is, is debatable this command and control management, its idea that, that an external source is going to catch us from problems is going to coordinate our change. And if you think about the complexity, you think about that death star diagram of microservices, how can anybody be that external authority?

00:10:51

And also if we've all studied our Damien, right, that, you know, one of his main points is to cease dependence on inspection, to achieve quality, right? That external inspection has yet to, uh, um, to achieve high quality rather than building quality controls into the system. So, you know, what is see going on here is this new way, these horizontal self-regulating systems kind of based on dev ops plus SRE thinking and practices is actually starting to replace and rebuild what we did in the traditional it at ITSMs world. And fundamentally, I think they're actually quite incompatible. Um, but that's a long, another longer discussion maybe for a different, uh, a different talk, but, um, robbing England and Charles bets, two faces probably familiar around here are, uh, may not agree with me with all this, but also great people to do a great job of documenting and how this world is fundamentally fundamentally changing.

00:11:43

So moving along, we know that it's, you know, we've got, uh, the digital transformation is driving our new architectures. We got a new way to run our people. And one of the things we've realized is what have we got going on here? Right. We've got this extreme, uh, complicated, uh, microservices architecture is combined with the go go, go speed. Uh, you know, we're really living in a world, uh, of, of, of, of it's a complex, it's a complex world. And I think, uh, Paul Reed, um, as done a great job at breaking this down to say that from the development side, people often think it's very deterministic. You know, we know how engine X works and if there was a bug, we'll see it right there. But once you get into that death star diagram of microservices and understanding the user traffic that is unpredictable going on there, we've really moved into a complex system.

00:12:27

If you think about complex systems, it's that, you know, we can't perfectly predict what the behavior is going to be. We can't just break it down and say, I understand how engine X works. I understand how my SQL works. Therefore, I understand how this complicated system is, is, is, is going to work. And so we have started thinking about it in terms of we're actually living in this kind of complex, complex world and people work in operations. I'm sure all of you have felt this for a long time. Right. And, uh, you know, there's a Seminole paper, Richard Cook. He's not a technologist. He has spoken here before. He's, uh, he's a, he's an anesthesiologist, but famous, uh, researcher nonetheless, uh, wrote this great paper, um, early nineties actually about how complex systems fail. And I highly recommend everybody look it up and read it. Um, it will, uh, definitely I think change your mind if you're already on this idea that you can not stop failure happening in a complex, uh, in a complex world, you can only only cope with it.

00:13:17

And, you know, a great thought that charity majors had, which has a very pithy way of saying things that distributed systems have an infinite list of almost impossible, fair failure scenarios, right? Hindsight bias. We always say, oh, we should've seen that coming. But the reality is, it's almost impossible, right? And it's never going to happen again. And as you see, talk to pivot people in an organization. So I spend a lot of time in resilience engineering, trying to improve their operations. It only gets weirder and weirder as you go. Right. Um, and, uh, you know, so that kind of brings us to this idea of safety science and resilience engineering and how that started to influence our world. And, uh, you know, these folks, doctor Dr. Woods, Dr. Cook, uh, Sidney Decker, uh, these are folks who are, you know, they're famous in thrilled worlds and they study things like aircraft disasters, nuclear power plant, you know, incidents, uh, healthcare disasters, um, and you know, there's decades and billions of dollars of research and time and effort that has gone around the world into these domains.

00:14:16

And, um, you know, they're now bringing them into our, into our world. Well, you know, why is that? Um, I thought, you know, John, Allspaw, who's one of the people really responsible for bringing this, this, this line of thinking into, into our world. Uh, he talks a lot about this, where he says that the reality is there is kind of above the line and below the line above the line is all the things that we think we're doing. Right. And we see the people we see abstractions, right? Cause he's abstractions, but the reality is the real system is underneath those abstractions. And we actually can't see it. We can never really get to it. It's just there all that we have as an idea in our head of what that actually is. Right. And my idea of what it is and your idea of what it is, even though we're probably talking about, think we're talking about the same thing is probably very different.

00:14:56

So really the only way we can manage these have hope of managing these systems is to worry about the interaction between the people who have to work on the systems so we can learn together so we can stay on the same page. So really it's about the people, just like it's about the people flying airplanes about the people operating in the operating room, right? It's the human management side is the difficult part. And you know, this above the line below the line metaphor, I think is great for, uh, uh, to, uh, to understand why. So that's gonna come into our industry now and, uh, you know, more folks to follow, uh, you know, uh, there's Paul, he actually runs this conference called redeploy called more like a gathering. It is now like the epicenter of the resilience engineering, um, folks who are, who have taken these ideas from the broader world from high consequence domains and bring them to managing, um, you know, the complex world that we, that we live in.

00:15:45

And, uh, they hate slogans. So I'm going to put their stuff they're talking about in bumper stickers, just to, just to make them mad. Uh, you know, they talk about things like there is no root cause, right? That, uh, you know, it's, that's just a political distinction, right? The idea that, yes, human beings like to draw a pic, like to draw things in a straight line. We like to cast blame somewhere. You know, going back to the earliest times, it was always the idea of an act of God, right? We want to cast blame in some place when the reality is, uh, you know, where we stop, right. It's not actually a root cause. There's still other many contributing factors going into that. So it's a fascinating world to get into, you know, the same way they'll poke apart. Why the five why's, uh, um, is actually, uh, doesn't achieve what we want achieve.

00:16:24

Uh, there's a new idea of safety. One and safety, two that in the old world safety one was we studied the problems. That's how we stop future problems. We keep worrying about what went wrong and figuring out in the safety to model, they flip it around and say, well, what my win, right? Because Rowdy's realized that the same things that people do day in day out that make the business run that same effort, the same activities that people do in a slightly different context, a slightly different combination causes disaster. Right. So if you don't understand why your systems actually work, um, cause there's probably, it's probably a miracle that they often do in the first place then, uh, we're not gonna understand why, why they don't work. So it's a very interesting view of the, uh, the world. And I love this idea, which is incidents equal unplanned investments, right?

00:17:06

The question is just, what's the ROI that you're going to get. Uh, you're gonna get out of it. So kind of you put all this together really where we're going with this is it's about, you know, elevating the human, right? How do we, um, you know, get more of the iron man model, right? Where we know we have to put the human and how do we support the human and how do we, uh, give them the tools and that they need to get their job done, not this idea that we're going to build robots that are somehow going to run these complex, these, these complex systems. And trust me, those other domains, they've spent billions of dollars and decades trying to get the human out of the operating room, the human out of the airplane, the human out of the, uh, the, you know, the, the, the, the power plant.

00:17:42

And they haven't figured out a way, way to do it. We're probably not going to solve it on our end. And also you see the other movement, which is know this idea of that ops work doesn't have to be miserable, right? Hong Kong doesn't have to be, be miserable. How can we focus on elevating the human? And they are our best assets, how do we stop burning them, burning them out? And I think a very good conversation happened on this stage, Christina Maslow, who John Willis and gene Kim brought into this world, uh, another world, famous researcher. She actually goes on 60 minutes, you know, talks about, you know, human burnout is really identified what burnout comes from, what the causes are, and it's not just overwork. Right. There's, uh, other, a lot of other contributing factors. Um, but it's really kind of brought a lot to this domain.

00:18:23

Um, you know, uh, folks like Jane, Jane Groll, you know, trying to identify and highlight who are the humans behind this? How do we elevate them and why we want to do this is there's, you know, uh, this came from a recent someone's S one I wrote there's 18 million, it operations professionals on the planet and includes networking and everything. And there's 22 million developers. Right. And how can we make all their lives a little bit, a little bit, uh, a little bit better. So that was the, the Stu the context that we're, that we're living in now, let's actually talk about, uh, you know, the cycle, the cycle of a, of an incident. And, uh, I kind of broke it down into these three, uh, these three areas. Um, so we can kind of break down and see what people are doing, uh, observe, react, I'm sorry.

00:19:05

Next slide. Observe, react, and, and learn. Uh, if you notice this kind of feels a lot like an OODA loop, right? If you haven't, uh, if you, no, uh, no, what that is, uh, it would have loops something that was originally devised by a air force, Colonel John Boyd, uh, kind of very famous in the war fighter, uh, community, really talking about that, all tactical deployment, all tactical activity, you do involves observing something, orienting yourself to what you're seeing, making decision on what you're going to do, and then acting and how fast you can go through that is how effective you can be in operating. And their scenario is came out first with airplane dogfighting. If you can go through your OODA loop faster than someone else goes through a loop, you're going to win the, a dog fight. Uh, it's a really fascinating, um, uh, field to, uh, to look into.

00:19:50

This is actually the first drawing that he ever did of the OODA loop. It's it's, uh, so famous in the military world is featured in the Marine Corps, uh, the Marine Corps museum as a founding, uh, founding document. And, um, so I'm going to, I'll put the little oriented decide on this, uh, on this map there for the OODA loop purists, but let's talk about, is going on, on the observed side, right? Uh, you know, monitoring this one, we've known for a long time, spotting the unknowns. How do the sweats give me a spot in the knowns? How do we set the traps to look for the things that we, the problems that have happened in the past, but talking about these complex systems, talking about the idea that there's this infinite number of failure scenarios, um, you know, how can we look for those, for those patterns we can't right.

00:20:30

And that's where this field of his or her abilities coming in, which is a really about interrogating the unknowns, right? How do we look at the unknown events? How do we look at the activity of our systems and break down and try to use the human to figure out what is actually, what is actually happening and really kind of breaks down into these three parts. Um, you know, number one is logging, right? That's the event. We need a record of the event, um, there's metrics, right? Which are data points over time, right. Um, we've lost all context of the event, but we can know, like, is this number higher or lower than it was than it was before? It's an important kind of leg of the three legged observability stool and the third one being tracing, right. Those events in the context of a single of a single request.

00:21:12

Right. So how do we look at all the events that happen in the context of a single request from the, from the human, a human perspective, um, charity majors and Adrian Cole, two folks that I highly recommend following in this, uh, in this particular area. And then there's a new kid on the block and the observer and the observed world. That's the idea of automated governance, right? That, uh, the enterprise, we can't forget about the, uh, you know, the, the, the, the, the controls we need to put in place and to make sure they're being, they're being followed. Um, it's an emerging thing. John Willis, uh, some working a lot on this. There was a dev ops, uh, um, uh, the DevOps enterprise summit, gene revolution runs a form. A lot of people got together, wrote a, wrote a paper on this. I think they have it available at the, uh, um, one of the booths, uh, somewhere.

00:21:56

But the idea is all this idea of the monitoring, the observability, the governance, how do we put that in the hands of everybody no longer that it's in the hands of a few, for them to have isolated views these things, but how do we diffuse, uh, these three facets of, of, of, of observation of visibility and spread them to everybody so they can actually, um, um, take action, right? So kind of moving along here on our cycle, there's the idea of, all right, we've gone from the, we're seeing what's going on. Now. We got to start orienting ourselves to it and make a decision of what we're going to do. Um, you know, where, where is, uh, where's that going? Um, so first step here is incident command, right? So mobilizing, coordinating communication between the, uh, between the people, right? And a lot of this starts back with something that came from the government side.

00:22:40

Again, the incident command system. Now it's under the auspices of a FEMA, but a lot of study for how do you mobilize and communicate and coordinate human beings to go resolve, resolve a issue. And one of the first people in our community that really kind of brought this in was, uh, Jesse Robbins, um, used to call him the master of disaster kind of ran the early game days at Amazon was really one of the big proponents of we have to break things to learn from things, right. And then guys like Brent Chatman now over at slack, doing the same thing, uh, Ernest Mueller, another, uh, um, uh, I think a good influence in this area, as well as, uh, the folks at PagerDuty, um, they've actually kinda started to document and open source and this get hub project, their incident response plan, all based on this incident command system, I just put math Strat in there.

00:23:26

It's like, he does a great job of explaining to the world. I'm sure there's other people who were also doing a lot of work work there. So the next sort of thing is, well, we're going to be mobilizing our people, right. Uh, you know, who are we mobilizing? And there's this kind of split that we see going on as to kind of divide and conquer that operations. Um, it's getting very blurry, right? Well, first it's got to, it started to get very blurry. Um, I don't know, at the, this is our t-shirt the very first, uh, dev op stays mountain view, the first one in the U S uh, Andrew Schafer, uh, one of his ideas, which was, uh, you know, ops who are devs, uh, you know, who liked devs to be ops who do ops, like their devs, who do dev like their ops.

00:24:01

And if you were a teenager in the nineties, you know what the, uh, the refrain is after, uh, after, after that. And, uh, so we saw this blurring and now we kind of see this, this, this, uh, this division that's starting to take place where you see organizations saying, look, we're going to take what was traditionally the operations domain. And we're going to split it into two to two distinct capabilities. One is platform engineering, right. Which looks more like a product or development team, and that's a centralized organization. Right. And then we're going to take the building, the, the, the operate, uh, oh, well, sorry. I have the same thing on my side here, but the people who operate the systems, right. The people who are, and that's going to be a distributed function, um, that's, uh, some people call it SRE somebody to call other things, but there's the distributed function that is embedded in all the teams that is doing the actual, uh, you know, running of the systems.

00:24:50

And that is distributed platform engineering is centralized. Uh, if you follow the folks at Disney who, uh, uh, presented here, I think today, um, uh, was a great, um, uh, 30, a lot of work in this area on the other side of the world, Shaun Norris, who was at JPMC standard charter now at, uh, um, at, uh, at pivotal also, uh, was driving kind of large scale in the financial world operations towards this way, to the social kind of divide, divide and divide and conquer. So, you know, we've got our people there. We're, we're, we've got the incident command system. We're motivating, we're moving them towards this, uh, uh, you know, towards doing something. Uh, I think it's interesting the world, the view on escalations, right? That there's kind of two themes here. One is avoid them at all costs, right? How do we push control closer to, uh, the people who first spot and respond to the problems?

00:25:39

Uh, Jody Mulkey, uh, former CTO of Ticketmaster did a lot of fascinating work there took their MTTR from like 40 something minutes for major events down to like four minutes. It's like a hundred percent of nine, you know, it's all from pushing control closest to the problem, right? So having to escalate up through that and, uh, John Hall, again comes back in here. I think it's a really interesting job to bring this idea of swarming that came out of the country, additional human call center, service management, uh, world, and saying, you know, instead of having these escalation trees, uh, build organizations that have capabilities to swarm, so you make sure that you're getting the problem to the right person, as soon as, as possible dramatically cutting down on these, these escalation chains, and then come along the return of runbooks right. That, uh, something I know a lot about, uh, but, you know, be able to take action, diagnose things, restore, restore your problems.

00:26:24

Uh, runbooks kind of disappeared for awhile. Um, they were big in the enterprise configuration management world told us it's going to be no ops. We didn't need, uh, we didn't need runbooks, but now thanks to the SRE movement. Uh, runbooks our, uh, our, our back and, uh, specifically runbook automation, right? So, uh, something that my colleague and I, Alex, on, we, um, we focus a lot on, which is, you know, how do you give safe self-service access to the expert knowledge that you need to take action. Right. So, um, you know, the knowledge part, it's easy to move the bits, right? You've got the scripts, you've got the API APIs, you've got the command line, right. But how do you take that knowledge out of the subject matter experts and formalize it and distribute in the organization? So those closest to the problem can, can use it, right.

00:27:06

Um, it's gotta be self service, right. Again, we have to empower those closest to the problem. We gotta get rid of those escalations, and it's gotta be safe, not only safe from, uh, let's mistake, proof it as much as we can to not make, uh, to, you know, to hand it off, to make smart choices, guardrails for people not to make problems, but also say from the auditor and the compliance perspective as well. And cause, you know, before runbook automation, no matter how good we were at at that first half of the cycle, uh, what happens, right? One of three things, either we're trying to decipher the Wiki is this thing, right. You know, visit the visit, uh, uh, you know, when was this written, was this person trying to say, right? Or we're kind of doing this ad hoc tool, script usage, like what would they tell me on Tuesday?

00:27:45

It's not dash I it's it's dashi or this even the right version of the, uh, of the script, but most likely what we're doing is escalate on up, right? Someone else is going to be able to solve this problem. But with Rumbek automation, we're empowering those people closest to those, uh, you know, to the signals to actually go ahead and take action by coordinating that, taking that knowledge, turn into automation that can coordinate all of the incantations of the scripts and the tools and the API APIs that we, uh, that we go and need to do it. And you know, kind of illustration is talk about like, sort of the old way, right? It's like, well, okay, you know, this is kind of cartoonish obviously, but you know, like a level one, now there's a problem. It's a problem with that service. So let's call the SRE or the on-call for that service.

00:28:23

And they go, oh, this is a problem with the application. They called it the developer on call. They go, ah, there's no data. It must be a database problem. They call the DBA, DBS shows up and goes, oh, this is a network problem. Right? So we're spending all this time escalating up and it's probably this larger blast radius than, than all of that, uh, versus, um, you know, what if, and also while that's happening, I said before written, injecting all of this, uh, um, delay into our organization, taking people off of the other work, they should be, they should be doing. And, you know, with runbook automation, it's like, well, how do we empower that level one with all of the, all of the, uh, the knowledge, um, of those different subject matter experts to go and draw and to go and first diagnose that problem.

00:29:05

And if they can't solve it right there they're know who to escalate to. Right. Um, better yet, you know, in the enterprise problems happen over and over and over again. So if we can give them, Hey, here's the check to see if this problem has existed, run all these checks and the one that you see here's, here's the action you can go to take and repair that. Uh, we see people talking about 80, 90% reduction in the time it takes to resolve these known incidents that, you know, happen over and over again. And also we're stopping those interruptions that all that times that as subject matter experts are being interrupted. We're putting two painful things in the organization. One is interruptions and the other is waiting, right? So asked now delay and wait for somebody. And then so being interrupted all day long, and then it comes time to go do something and what happens.

00:29:44

You're not waiting in somebody else's queue for somebody else to do something, right? So, you know, another area where runbook automation is used very successfully is being able to, um, to provide people with the self-service. So you don't have to constantly be in that, in that, in that chain. And now that we're getting in the dev ops world and it's, we've got cross-functional teams, we've got to get into, um, uh, let developers do restarts and production manual build and run, build, and run teams. You build it, you run it, how are we actually going to do that? Right. And you know, it was runbook automation. You're actually being able to say, Hey, instead of just saying, Hey, here's an SSH key and some pseudo privileges and a shell script and say a prayer and have a good time. Uh, we can say, Hey, let's give them named access to these particular procedures.

00:30:23

And that's how we're going to get around these security and compliance issues, because we were able to run it through an SDLC. We're able to do a, uh, uh, do a code review operations security, say, yes, this is good. Then let's use the access control to turn it around and let somebody else somebody else do it. Um, and if you saw this earlier this week, but, uh, Bob GoodCo from capital one was talking about the custom system that they built, uh, does a very similar thing. And the whole idea is runbooks as a service is to take all this and to centralize it because they want to do two things. One is they want to rapidly be able to know, is this a known problem? Let's try the know and fix. And if it's not, let's figure out as fast as possible, how to through the automated system, how to call the right diagnostics.

00:31:03

So we know who to escalate, it, escalate it to, and then kind of wrapping up here. Uh, we got, um, you know, the learn part right now, usually that circle kind of the, the, the OODA loop is a top half of their right, the observe and react, observe and react. But after the incident comes to learning, right. And, um, again, you should look a lot of things that John Allspaw has been talking about, but he talks about how the problem is in many enterprises. We think the value is the action items, right? So some people like I'll do some, an email, I'll send an email ahead of time. Maybe we'll begrudgingly get together to talk about, talk about the post-mortem. Maybe we'll do a report. And all the execs want to know is what's the action items, right? And the reality is there's very little value in that.

00:31:45

You're probably going to actually create more problems that will create more cause more outages in the future and not really get the real problems. The real value is in the journey along the way, right? The focus on the learning, the storytelling, understanding all the contributing, contributing factors change people's minds towards, it's not about the outcome of the action items. It's about that collective understanding. So we have a better understanding of how our above the line action helps the below the line action. And, you know, again, why, because these incidents are unplanned investments, right? And the ROI is up to is up to us. The money's already being spent, the money is already being blown. How can we make that money less, but also how can we get the most value out of it for the organization? It's an investment in the survivability, in the future of your, your organization.

00:32:27

So to recap, don't forget all the things that we're stewing in. If a lot of people jump to talking about, we're just going to fix these, these just change, how we do incident management. The reality is we don't bring, bring people through the path of how we got to where we are and what's the contributing factors are, uh, in our industry. It's going to be hard to, to jump to the, the answers. Um, and then, you know, look to what a lot of these organizations are doing across these different parts of this, uh, of this cycle. Uh, so my name's David Edwards. That's my talk again. The slides are there. You can, uh, hit me on Twitter, Twitter, anytime you want, or just email me directly. Uh, we'll be at the Rundeck booth the rest of the day over in the exhibit hall. If you want to talk about any of these things and, um, enjoy lunch. Thank you.