Las Vegas 2020

Runbook Automation: Old News or a Key to Unlock Performance?

Damon Edwards is a Co-Founder of Rundeck, Inc., the makers of Rundeck, the popular open source Self-Service Operations platform.


Damon was previously a Managing Partner at DTO Solutions, a DevOps and IT Operations improvement consultancy. Damon has spent the past years working with both the technology and business ends of IT operations and is noted for being a leader in porting cutting-edge DevOps and SRE techniques to large enterprise organizations.


Damon is also a frequent conference speaker and writer who focuses on DevOps, SRE, and IT operations improvement topics. Damon is active in the international DevOps community, including being a co-host of the DevOps Cafe podcast, an early core organizer of the DevOps Days conference series, and a content chair for Gene Kim’s DevOps Enterprise Summit.

DE

Damon Edwards

Co-Founder and Chief Product Officer, Rundeck

Transcript

00:00:07

Many people ask me, gene, how exactly did you get into the DevOps movement? And the answer I usually give is, well, it was just a natural continuation of my 21 year journey, studying high performing technology organizations, which drew me to the center of the dev ops movement, which I think is urgent and important. But if you kept asking me, well, how exactly did you stumble into ops movement? Eventually you would hear me say in 2010, when I was at tripwire, I got this email out of the blue, from someone I had never heard of inviting me to be on a panel and an event I'd never heard of. So I went to it and I was blown away by what I saw. And there, I met this amazing group of Mavericks that were at the epicenter of the DevOps movement, including John Allspaw, who you heard from yesterday, John Willis, Patrick , Andrew Shaffer.

00:00:51

And so many more that event was dev op stays 2010. The first dev stays in the U S created by Patrick Dubois. The person who emailed me, it was Damon Edwards, who was one of the conference organizers along with John Willis. So over the years, Damon has been one of my favorite people to collaborate with. He's been a part of the program committee from the very beginning and he helps shape all the next generation ops and infrastructure programming talks. Damon will talk about the gap that still remains for operations, despite SRV, despite platforms, deployment automation and the technology Devor Damon is a co-founder of Rundeck. And so congratulations to Damon and team for the recent acquisition by PagerDuty here's stamen.

00:01:36

Thanks, Jean. I appreciate the kind words. It's great to be here today. Uh, even in this virtual Las Vegas and these unprecedented times, um, it's going to be great next year to be able to see all of you again, in person, as Jean said, my name is Damon Edwards. Uh, you might know me from my work at Rundeck where I was one of the co-founders, but as of a couple of weeks ago, big news, um, we're now part of PagerDuty. So Rundeck has been acquired by PagerDuty. It's great to be, uh, to be joining forces. Um, but enough about me let's uh, let's let's get started. So who knows what this is, right. BA excuse me, up, up, down, down left, right. Left, right. BA start. Right. Uh, I think if you paid video games and the late eighties, early nineties, your thumbs might've been twitching.

00:02:18

When I said that, of course, this is the, uh, the famous Konami G codes, right? So a whole lot of games, this unlocked extra lives or extra power-ups or slowed things down, or sped things up a variety of things were based on this system that you were, that you were playing, but it would give you an advantage, right? That's what these cheat codes are there unlocks it unlocks capabilities. It makes it easier to overcome the obstacles and the system with which you're, uh, you're playing, um, kind of moving towards whatever the goal or the quest is for that, for that video game. So you might be asking, well, what's this got to do with the dev ops conference, right. And, you know, I started thinking recently about, you know, that as our world's changed around us, what is it about these events, right. What brings us here?

00:02:58

And of course there's the fellowship and the camaraderie and often the commiseration. Right. Um, but I think there's more, more to it. And at the base of it, I believe it's about these cheat codes. It's about these unlocks, right? The design patterns, the, uh, uh, the principles, the, uh, you know, the techniques, so we can learn from each other and then go back and apply it to the systems in which we work in to help ourselves and our colleagues and our companies, you know, um, overcome the obstacles, make the obstacles a little, a little lower. The barrier is a little, um, a little further away and, uh, you know, um, you know, improve our, our overall overall performance. So I started thinking about these, like, Chico's sort of think about, well, you know, what's the next unlock, right? What are the next, uh, areas that we can really focus on to unlock the most, uh, the most value.

00:03:43

And for me, you know, I think one of the thesis here of my talk is that the next great unlocks are gonna come from from operations. Right. And I really think that if you think about operations activity, not just what happens in the four walls of the operations organization, but operations activity, wherever it may lay, there's really kind of two main parts to it, right. That are distinctly, I think operations one is incident management, you know, spotting and resolving, uh, uh, problems, right. And the other is service requests, right? How do we, you know, take the business requests or, uh, requests from our colleagues and handle them as quickly as, um, as quickly as possible. Now I want to really kind of focus in on incident management, because I think that's where really the rubber meets the road in terms of, um, you know, the capabilities of an operations organization, they believe to spot and, and, and resolve problems as quickly as, as, uh, as possible.

00:04:31

And, you know, what do we all want out of our incident management? Right. I think this hasn't really changed for, for decades, right? We want shorter incidents. We wanna be able to solve these problems for our customers or for our users as fast as possible, but we also want to do it with as little disruption to the rest of the organization. Right. So shorter incidents, fewer escalations is what we're after, but what's always gotten in the way. Right. And I think we've learned now that what gets in the way really is complexity. Right. And, um, you know, and I don't mean complicated. Right. And be clear about that. You know, like a car engine is complicated, um, but really it's complexity, right? It's the randomness and the unpredictable unpredictability of something like say, you know, traffic in a city. Right. And, um, it's really, you know, as our systems have become more complex, it's been understanding that, you know, it's really dealing with that complexity that has really got in the way and all along has prevented us from, you know, having the shorter incidents and the fewer escalations that we all, that we've always we've always wanted.

00:05:25

And, um, you know, it was really, um, you know, J Paul Reed and John Allspaw, you know, to folks who in our industry who have done a great job of bringing kind of the, the broader world of, uh, you know, safety sciences, and really understanding, you know, complex systems and why problems happen and often don't happen and really kind of translate it into our, our domain and, uh, been lucky to work with them for the past few years in the, in the community sense. And, you know, really, they've kind of beaten into my head, this idea that, Hey, you know, our world is complex and it's not deterministic. And often it's that, it's that difference in that viewpoint that, you know, are, are, is what we do a deterministic predictable thing, or are we living in kind of the, the, the randomness and unpredictability of complex systems?

00:06:06

It's those two belief, those two viewpoints that often kind of are at the root of the conflict of these, you know, these DevOps conflicts, um, that we've all been experiencing and have been gathering to, uh, to try to, to try to solve. And just to kind of lay out a little more clearly what I'm talking about, think about world, uh, the world from the point of view of, um, from the development side of the house, right. That things it's much more of it, you're kind of trained to think in a much more deterministic, uh, point of view, right? So, you know, you write some code, it either builds, or it doesn't, it either runs, or it doesn't right. It's a very binary activity. There's inputs, there's outputs. If something goes wrong along the way, you can put your finger on exactly what it, what it was right now.

00:06:45

Um, you know, as we start to think in more distributed systems, we start to carry that same, um, you know, to deterministic point of view that, oh, it's just this, it's a broader collection of these deterministic pieces. And, um, we can predict the inputs and the outputs, and, uh, we can version these things and we know what version we're on. And, um, you know, we can kind of move in a very orderly and predictable way. Now, if you come at this problem from the operation side, things don't look the same. Right. Um, in fact, uh, this is a great tool, um, came out of Cornell. It's, it's a visualization tool for, um, for, you know, microservices architectures, right? So this is a kind of a modern, uh, mid-sized public, uh, SAS. And if you look around the outside that grill that tiny gray text, those are different instances of the microservices, um, in their, uh, in this service.

00:07:33

And the blue is all the, um, over time, the, um, the connectivity, uh, the, uh, the communication between those, uh, between those various, those various services and asides from looking suspiciously like a 1990s, uh, data center, wiring closet, um, you know, it kind of brings up this other idea that, well, you know, the world's not so, so orderly, right? And we're on the left. We're thinking about things in much more of a, it's a, it's a technical system, right on the right we're realizing this is actually much more of a socio-technical system, right? Because there's all this activity and all this uncertainty that we can't control going on around, uh, that death star as they call it diagram, right. Things from, you know, uh, the, the, you know, network traffic changes, API performances might change. Libraries are updated. Um, configurations are change, uh, you know, different cloud providers, the hardware, hardware variation might be, um, uh, introduced.

00:08:26

And of course, all along, you know, we're constantly tailoring the system or, you know, we're either for performance reasons or business business fit. Um, you know, it's never kind of one size fits all. We're always slightly changing and tweaking these systems and all these things are being tweaked around us. Right. And, um, as a Richard Cook, who's a Dr. Cook is one of the, um, I think the giants in the safety sciences world, um, you know, bringing a lot of what goes on and, and, uh, and, and, you know, medical disasters, uh, you know, transportation problems, you know, industrial and industrial accidents, uh, really, they kind of broader safety sciences bringing that, you know, has been bringing that into our, uh, domain as of late in. I love when he calls this on the left, the system has imagined, right, this is what we're holding in our head where our brains want to drive towards this, this, um, this, this predictable deterministic point of view versus the system has found, which is really we're dealing with complex, um, you know, much more random, much more unpredictable, uh, systems.

00:09:22

We can reason about them. We can, you know, make assumptions about them, but we can never perfectly predict what's what's, um, what's happening. And, you know, Dr. Cook, uh, likes to point out that, well, you know, what are the, what's the role of the human in these systems? Because it's very unique that, uh, you know, the human plays two roles here, right? That the human is the one who fixes the system and the human were also the ones that caused the problems in the system. Right? So this is kind of interesting duality that we have to deal. So what are humans doing day in, day out if you watch them working, you know, in these systems as found, this is kind of these four pieces. One is no monitoring looking for signals. Uh, the second is responding, right? So, you know, responding as in mobilizing to make sense of what they are, of what, what you're seeing from those, from those signals.

00:10:03

And then there's adapting right. More tailoring of that system to try to adapt the behavior, to be what you actually want it to be. And then of course there's learning, right? The feedback loops and understanding of what, what just happened. Right. And, um, you know, think about it, uh, automation alone, can't do two, three and four. That's really the domain of where humans are our best right. Automation can help us there. But we think about it, you know, being able to coordinate, being able to, you know, redirect our attention, being able to use creativity, being able to be surprised, right. Being able to say a simple thing like, huh, I didn't expect you to do that. Can you tell me why, what you were seeing that caused you to do that? Right. That simple kind of question, you know, machines are terrible at that. Humans are great at that sort of synthesis and that sort of creative, creative thinking.

00:10:46

So the automation alone can't really can't really help us there. And in fact, if you kind of look at the research across all these other high consequence, you know, domains, right. Places where they spent billions of dollars and thousands of not more, uh, you know, person years worth of, um, worth of research, you know, it could be, you know, in the medical fields, you know, nuclear power plants, uh, you know, transportation safety, um, they kind of all come to the same conclusion, which is the role of automation is best when it serves to support the human operator, not to replace the, the human operator places where they've tried to replace the human opera and the human operator. It hasn't gone so well right now, the place where they've built the automation systems to support the human brain, to support the human operator, um, they get a lot high, a lot better, a lot better, better results, right?

00:11:31

So it should hold that. That's gonna be the same thing for our field as, as well. And, you know, um, like talk to more, more here about, about Dr. Cook, but, you know, he kind of, he calls us out to say, Hey, we have to learn to trust in our operators. Right. And, um, you know, too much of our design historically has gone into preventing people from doing things. It's all about building the ability to stop you from doing things to take away, to make more things more of a, more of a black box. And we haven't gotten the results that we, that we wanted. And we're going to learn the lessons from these other domains, it's that we need to reveal the actual controls that are available. We need to find the right levers and knobs and make them available to the humans. So the human brain can do what the human brain is, um, is best at.

00:12:13

And the way I would describe this, as you know, it's like finding the right abstraction layer here, right. If we go too high, right. If we get to that black box level, well, bad things happen there. In fact, there's a great, uh, there's a great paper called the ironies of automation. It came from 1982, right. Uh, Dr. Bainbridge and, you know, uh, you know, it's a great, uh, discussion around how the more automation you add, you get these unintended consequences, right. And how, you know, bad things can start to, uh, start to happen. The more you tend to tend towards that black box automation is a fantastic paper. Um, along with Dr. Cook's, uh, how, how complex systems fail. Some of these great things was written, you know, uh, a few decades ago, right. And you think it's talking about the distributed, uh, you know, digital, uh, you know, digital business systems that we're running today, but it's not, it's talking about a lot of other, a lot of other, um, other domains, but it's just as just as relevant.

00:13:04

So then the other side of the distraction is that we go too low, right. Uh, I call this the SSH, you know, Sudu a bag of scripts and say a prayer, right. Um, we all know, you know, the, the, the randomness and the, uh, the variability and the problems that we have there as well. So that's the abstracting layer going too low. So it's this delicate dance to find that right abstraction layer, but you know, what actually ends up happening. And what I see in, in most, in, in, in most companies is instead, they kind of punt on the whole issue of how to build that obstruction. And instead their experts become the abstraction. And what I mean by that is, you know, there's somebody here like Alice, she's our, you know, one of our key individual contributors, and she knows all the scripts and tools and commands.

00:13:40

Um, but she knows how to target them. She knows what order to run them. She knows, um, you know, what, uh, options to, uh, to provide and when to provide it. Um, she also knows how to interpret what comes back to say, Hey, is this good? Is this, is this good? Or is this bad? So, you know, we kind of hold up these, these experts, they become our, our, our abstraction. And of course now they become the bottleneck. They become the silo. And, you know, we're trying to, now everybody has to push their way through that expert in order to get anything done in these, um, in these environments. And we have all kinds of, um, you know, repercussions that come that come, uh, come from that. And of course, so what's the first step we do is say, well, let's maybe add some more of these experts, right?

00:14:17

So now we have to take the time to try to train up these extra master master craftsmen. And now we have these extra coordination issues, um, inside that expert silo. And we haven't solved the pressure problem from a, you know, from, uh, from, from outside. And we see a lot of the, you know, the, uh, uh, the, the common problems that we see, you know, kind of day in, day out and enterprise enterprise operations. So, you know what, uh, you know, the answer here is to, uh, apply self service, right? And so the idea is, and I don't mean self-service from the, just, you know, how do you make it, uh, social, so someone can run a script, but how do you take all that knowledge? How do you take that knowledge from, uh, you know, those experts heads about how do you invoke the right thing at the right time with the right options, with the right guardrails around it, to make sure you don't pick the wrong things with the air handling, with the notifications, with the ability to, uh, you know, to really do everything that that expert would do to, to invoke those underlying, uh, you know, tools and script, how do you abstract that into a, uh, into, uh, into a self-service layer, right?

00:15:17

And the idea here being that it's not just to make the experts work faster, but to be able to, to safely give that self-service to other folks outside of, um, you know, outside of that expert, that expert, uh, silo, whether they be within the traditional operations bounds, or, um, you know, in the kind of a new, broader you build it, you run it world. And the key thing being, we're not changing the underlying tools, we're not trying to trying to obfuscate or hide those things. We're just trying to capture the knowledge of how to invoke and use those things in a safe way. That's safe and repeatable way. So let me give you an incident management example again, here's Alice, right. You know, before, you know, Alice was one of our senior individual contributors, she knew, um, you know, she knows all the, a lot of the ins and outs, but just can't know, know everything.

00:15:58

Right. So when alert comes in, you know, what are Alice's options here? Right. Um, you know, one is, Hey, well, I can look in the Wiki. I can look for notes. I can try to find, you know, how, you know, what's the right way to manage these other parts of the system, but then you're, well, do I have the right information here? Is it from the right person? Is this even valid anymore? What are they trying to say? Um, or I might have, you know, shared scripts or shared tools, but again, I'm in the same problem. Do I have the right version? Um, Hey, the network folks said don't ever use dash I'll wait, no don't ever use dashi. Right. Uh, it can be, um, you know, you're kind of fraught with peril there as well. So what ends up happening is what does Alice do?

00:16:33

What's the old escalation, right? How do I pull in as many people as possible from other parts of the organization who might know these different pieces, right? And along this way, our incidents are obviously taking longer and we're having those, pushing those disruptive escalations throughout the organization versus this self service approach where we have this, you know, system this, um, uh, you know, this, this layer of, of automated self-service, what ALS is able to do is basically react with the same, uh, efforts that her other expert colleagues would do. Right. So it's saying, well, how would the network team diagnosis, how would the database team diagnosis, you know, what the platform team do, and being able to run all those herself, right. And then maybe also have remediation steps beyond that, right. How do I restart this? Or how do I clear the cash, or how do I reset this?

00:17:17

Or how do I roll back to a, no one good, or how do I fail over to a, um, you know, to a fail over system, all of that can now be put in Alice's hands and she can act with the same expertise, not all expertise, but a lot of the same expertise that, um, you know, her, her, uh, expert colleagues, uh, could do. And there's got to patterns that you see people applying here. One is sort of the iron man or iron suit iron person suit, um, which is, Hey, how can we, um, augment the human, um, which is what the, with as much ability to diagnose and resolve these problems. The other is more of the robot approach to say, Hey, how can we, maybe pre-process a lot of these alerts and run these diagnostics, um, before the, the, you know, Alice has time to log in or, Hey, we know there's a certain issue and it's going to happen again and again, because the dev team doesn't have budget to fix it yet.

00:18:04

So we want to set up more of these automated processes to automatically call that self service, um, you know, from the alert system, um, to try to get, you know, to try to keep these problems off of, off of Atlas, right. And of course even better, it would be, Hey, you know, great. If Alice can now take the self-service and hand it off to somebody, you know, to somebody else, right. So now she can focus on or other work and, you know, so she can distribute, distribute that operational burden throughout the organization, not constantly being interrupted. And this works great for service requests as well. Right. So, you know, think about Alison are in her day job trying to get things done, all the requests that come in, she needs, people need things, or want her to do something, or need her to, you know, answer, uh, answer a question, you know, all of that, um, you know, it's, it's goes into a ticket queue, right?

00:18:47

Which means you've got a little bit of waiting, right. That's being, um, um, that those, those people have to deal with, or maybe a lot of waiting. And these interruptions that are all being pushed at, um, you know, push towards a push towards Alice, which is keeping her from doing her other high value work that we want her to be, to be doing. Instead, by setting up the self service, Alice can get these other folks to be able to help themselves, right. So, uh, they need to set something up. They need to change something, you know, whatever, or they need to run a report, whatever it might be, performance checks, you know, who knows, right. That she can take those repetitive requests, turn them into the self-service, let other people safely run them. Alice can focus on her other, her other work. Right. And this also helps enable new organizational models, right?

00:19:28

So, you know, you build it, you run it as a new, a new idea that has a lot, a lot of promise. One of these kinds of unlocks are these are these cheat codes, but the problem is in a lot of big, a lot, especially a lot of, you know, highly regulated, secure organizations, there's this all kinds of, there's all sorts of peril of saying, Hey, you know, we're just gonna let people have access to these production environments. But by using that, self-service, we can say, Hey, let's operations team, you know, security team compliance team, let's vet this code, let's vet these, these procedures that we want to run through this self-service. And, um, now we can then turn around and say, this looks good. And let you know, really anybody run these and they don't have direct access to the system. Everybody's much happier compliance as much as much happier.

00:20:08

So this self-service and ability, you know, in addition to kind of relieving these, uh, these headaches all around the organization, we can use it to enable these new, um, these new organizational models. Right. And so, you know, what's the magnitude of impact that we're, that we're talking about here, right? Because in any organization you can say, Hey, this is going to be great. It's going to, it's going to save, uh, it's going to cut down on the headaches. Right. But, um, I think we start to get into the finance side of the house. They're like, well, we pay you to have those headaches. So what's in it, what's in it for us. Right. And of course, this is just sort of looking at what's possible with, you know, with putting self-service into place, right. Your, you know, your mileage may, uh, may vary, but when someone comes knocking and says, well, what is this going to get us?

00:20:45

Right. Well, let's think about how the shorter the incidents can be. We see folks talking about, you know, you know, 30, 40, 50, 60%, you know, shorter incidents. Um, and you know, this is not kind of an MTTR calculation. It's more of an anecdotal learning to look at. Um, uh, you look at past events and say, Hey, before we had the self-service, how long would all this escalations take us? How, how difficult would it be to get the right people at the right place to make the right decision and to do it and to do the right thing, versus in this new self service model, we're pushing control closer to the people at the, at the edge, take out all of that escalation, um, take out all that waiting, being able to diagnose and solve problems quicker, and by a broader audience, how much faster can we get things, get things done.

00:21:28

Likewise, when the more interesting, um, characteristics I've seen of organizations have gone this way is the number of escalations, you know, talking about cutting escalations in half, right? How many of these repetitive problems happen time and time again, can you solve by pushing control closer to the edge versus about having these kind of these escalation chains that are constantly interrupting, uh, you know, other folks in the organization, and then when it comes to things like service requests, you know, I mean what's instant gratification, right? Let's say 99%, you know, it could be even faster turnaround time versus the old fill out a ticket, interrupt. Somebody wait for them to do something and do a couple of rounds and rounds on communication. And then finally, finally get it done. Instead having the self service, someone can do it and do it themselves, um, you know, kind of a huge gain there.

00:22:11

So if you add all that up and look at what goes on in your organization, you know, it's, you know, some people say it's up to, you know, 15, 20% of the total organizational, uh, you know, time that happens and this and this operational area, you can save by just applying that self-service and cutting out, um, all of this, uh, all of this wave's waste. So, you know, you gotta do the calculations for your own organization, but, you know, it can really be a, uh, a massive unlock there. So, um, you know, how's, this runbook audit, how's it how's this self-service created. Right. And I think, you know, one of the, um, key design patterns is the notion of runbook automation, right? And, you know, meaning that you take those procedures, take that knowledge, create these automated workflows and build these right abstraction layers, right.

00:22:56

That, um, are still showing the visibility down to the underlying tools and scripts, but also can containing enough of that knowledge and enough of those guardrails to guide people in the right and the right direction. And, you know, uh, if you think about it, there really is a sort of half the problem. And that's kind of the do side of the problem. How do you take action? The other side of the problem is the view side of the problem. How do we augment the human with the right information, the right knowledge to be able to take action. And I think, you know, after some fits and starts, I think what we've seen lately kind of coming out of the AI ops, and it was an observability space really is the next unlock there helps provide that, that view. So, um, we're not only, you know, giving people the ability to take action, but all the right information about being able to take, uh, to take, uh, the right action or at least learn from their systems and be able to reason as to what the, uh, you know, the next, uh, the next step should, should be.

00:23:47

So that's my thesis. That's my talk. Um, you know, I think these are the next grade unlocks, you know, the self service operations, runbook automation, AI ops, and observability. Um, we'll have to talk more about this if you, uh, think I was convincing or you think I wasn't convincing, uh, you can let me know, um, damon@pagerduty.com that's new, or, uh, you can tweet at me if that's your thing, or, um, in our, uh, Rundeck colleagues, we've got a booth in the virtual expo, stop by and say, hi, and, uh, you can grab the slides at rundeck.com/uh, does 20. Um, and it's got these links to these different papers and talks that I think, uh, might help fill in some of these details for all of you. So thanks for having me, and I appreciate you all being here and again, hope to see you next year in person. Thank you.