Las Vegas 2018

Operations - The Last Mile

Why is it that some DevOps transformations stall while others continue to flourish?


This talk will make the case that Operations is the most predictable differentiator. So much of the energy of the DevOps movement has gone into activities that start in Dev and move towards Ops/agile practices, automated deployment pipelines, automated testing, and of course, the unofficial mantra of "deploy, deploy, deploy."


However, when it comes to Operations, too many DevOps transformations have stuck with the status quo and left problematic Operations practices in place. By not fully engaging with and transforming Operations, companies are preventing themselves from realizing the full potential of their DevOps investments. This gap is the last mile problem of DevOps.


This talk will first examine the trouble with the various siloed, ticket-driven, low trust, and centralized practices that have been accepted as status quo in Operations for far too long. Then we will look at the specific techniques being used by high-performing Operations organizations who are fundamentally transforming how they operate.


Damon Edwards is a Co-Founder and Chief Product Officer of Rundeck, Inc., the makers of Rundeck, the popular open source operations management platform. Damon was previously a Managing Partner at DTO Solutions, a DevOps and IT Operations improvement consultancy.


Damon has spent over 15 years working with both the technology and business ends of IT operations and is noted for being a leader in porting cutting-edge DevOps and SRE techniques to large enterprise organizations. Damon is also a frequent conference speaker and writer who focuses on DevOps and IT operations improvement topics. Damon is active in the international DevOps community, including being a co-host of the DevOps Cafe podcast, an early core organizer of the DevOps Days conference series.

DE

Damon Edwards

Co-founder and Chief Product Officer, Rundeck

Transcript

00:00:02

How many folks think things that, the things we're talking about here at these conferences, uh, have the opportunity, the potential to improve the lives of the people who not only work in technology organizations, but the bottom line of those technology organizations. Gimme like a gimme like a yes. Yeah. Yeah, me too. And that's why I'm here to tell you about what's gonna get in the way. So, um, yeah, the point of this doc, uh, just about the title, the Last Mile, if you aren't familiar with that term, it's coming outta the telco industry. The idea is like you build this network of value, all these things you've done, but there's gonna be that last mile to connect to the user, to connect to the customer, so we can realize the full potential and the full value of all the work that we're doing. And that, uh, my thesis is that that is operations.

00:00:42

Um, so let's get going there. So, developers have had this unfair advantage over operations. If you think about it, uh, for the last 17 or so years, this thing called Agile has been seeping into their brains right now. Maybe they haven't been doing Agile, but if you think about it, it's in the textbooks or textbooks. It's in the books, it's in the, the conference speeches, it's in the language. It's in the tools. And this kind of ideas of lean and flow and fast feedback working in small batches. It's been steeping into, into, uh, developer's brains for a long time. Think about operations. What is the last sort of intellectual movement that has really swept through the entire industry? ITIL 1989, right? And so, if you think about it, I wanna talk about these DevOps concepts and everything. You know, this developers have this unfair advantage.

00:01:25

So now, here we are in 2018, right? And ops is in this really tough position. It's under pressure on one side, it's all the go, go, go, you know, digital DevOps transformation, um, you know, go faster, open things up. But on the other side, often from the same business folks, it's lock it down, right? Don't be the next hack. Don't be the next breach. Keep us out of the news, right? And as, uh, you know, the, our digital pipeline is becoming more and more our factory floor. This is gonna matter more and more. So ops is being squeezed between this pressure and it's hard enough time managing that pressure, let alone finding the time to improve how things work, to join the transformation, to complete that last, uh, that last mile, okay? Story time, just like Willie Nelson with less guitars and less gold records. Uh, let's go here.

00:02:09

So this is a true story. The names have been changed to protect the, uh, not so innocent look around you. It could be somebody next to you, maybe it's your company. Um, but, uh, they were, uh, you know, big on change. Change was sweeping the company, right? They're digital. They had digital transformation. They're talking agile. They're talking DevOps, spinning up a new SRE organization, right? On the technology side. They got cloud, they got docker, Kubernetes, microservices. It's just, you know, it's great times. Go, go, go, uh, give a lot of speeches about it. Everyone said, that's awesome. I wanna work there. I wanna be a part of that. But nobody was talking about what happened after deployment, right? It was all about getting to deployment. And then when you burst through that mirage of deployment, it's like 2005 all over again, right? Silos, tickets, you know, conflict.

00:02:51

So let's look at like, what again, the true story of what one of those days looks like. This is an average Tuesday, right? Was it Tuesday? Uh, so nine 30 in the morning, right? Uh, the knock starts seeing some lights going, eh, we've seen some intermittent errors this week. Is this the same? Same one, I don't know. Looks a little different. Not sure. Half hour later, business manager calls, I guess he's running while he is on the phone, uh, says, uh, you know, customer issue, blah, blah, blah outage. So, right? The knock knocks is, let's escalate. Bob is our incident commander. Uh, Bob opens a ticket, uh, that ticket goes out the blast to the business manager who was on the phone, and then all the app specific SREs. 'cause we have no idea what actually is going on. Uh, so that, I'm gonna add a little lean sort of lens kind of popups here.

00:03:32

You know, that's an interruption, right? That's a whole lot of context switching already happening right there. So people jump in, they get on the bridge call, they get into their try this, then try that loop. Of course, they don't have access to all the, all the right systems. So we gotta call in someone from the legacy system administrator team that has access to the production environments with the customer data in it. And, uh, you know, they're going round and round. And of course, business managers being resourceful. It's how they got in those positions. They find their way onto the bridge call as they do. You know, is it fixed? Is it fixed? What's going on? Is it fixed? Is it fixed? So we got all this waiting happening, right? Uh, we got the dog pile where everyone's just trying things and kind of, oh, I, you log in, you run top.

00:04:08

And the first 10 things, you see the other 10 people on the call running top right? You know, so, and disconnected access is getting in our way, right? So finally, right? Ah, it's a problem with the food service, right? The food SRE is just like, ah, okay, well, can you fix it? Right? No, it's a, you know, we're doing this, you know, we're deploying all, all this new versions. It's a new version. I haven't been told about it yet. I don't really know what's going, what's going on. Sorry. Um, so Bob says, all right, let's escalate. Right? Uh, and this is partially done work in the lean sense, right? They've come to a conclusion and they can't go any further. Gotta pass it on to somebody else to do it for them. Another escalation, uh, you know, more waiting. And, uh, so here, meet Karen, right?

00:04:46

Who is our, uh, lead dev on the fu uh, service. Uh, she sits in the cool floor with the new open floor plan. She's got her tea, her headphones are on. She's in her last, you know, uh, day of the, of the sprint. Life is good in, uh, development and science. Somebody knocks on her door, right? Because he's been ignoring the emails because she's, you know, locked everything out. 'cause she's in the sprint and says, Hey, did you see that ticket? Right? You know, so here comes the interruption. And she's like, okay, well, oops, sorry, I'm pressing the wrong button, you know, uh, okay, fine, I'll take a look. I'll contact switch onto something else, right? First thing you notice is I'm gonna need a lot more information here. I don't have these log files. Of course, we don't give care and access to it 'cause it's in production.

00:05:24

So she's gonna open up, uh, uh, ticket saying, help. I need help. But of course, you know, it's disconnected access. But she knows, you know, the hip chat that the, that the, um, some of those SR SRE folks or old system in folks, you know, sit on. So she said, Hey, can someone help me with this ticket? I need some logs. Um, so of course, uh, Lee being a very hand handy, uh, person pops up and says, here's the logs. Now how many think those are the right logs in the first time, right? Yeah, exactly. Exactly right. So they go no fault to anybody, they go, but they go round and round and around, right? And if you notice here, we got this little thing down the bottom, right? I call that the context wagon, right? And when you're working this ticket driven away, you may not be even active on that ticket, but it's, it's, it's there.

00:06:02

It's got your name on it. It's occupying some little piece of the brain. So watch the, the context wagon as it as it goes, right? So more waiting and more interruptions and, uh, more context switching. So finally, Karen gets the logs that she thinks she needs and goes, wait, wait a minute. You know, again, remember going, they're going docker. So it's all to solve all the problems, right? And, uh, says, Hey, what's up with these services? Right? Whoever restarted these, uh, use the, the incorrect environment variables, right? We're gonna need to restart all these services, this whole service pool and, uh, with the right variables. Otherwise, we're gonna see more of these cascading problems. So Bob, our incident commander, now it's two o'clock in the afternoon, right? This started at like 10:00 AM says, Hey, uh, middleware team. 'cause uh, made sense to give the, uh, the container infrastructure to the middleware team.

00:06:44

We need a urgent restart this entire app pool, uh, with the right environment variables. Uh, so more partially done work, more waiting, uh, more interruptions, more context switching context, wagon getting bigger, uh, selling the phone rings. It's Melissa, the middleware manager. Are you nuts? It's the middle of the day. You know, we can't restart these services. We're gonna need business approval to do this. So Bob goes right to the top, right? Uh, gets past this. He says, okay, well, I guess you gotta do this extra process. So we're gonna do it. Uh, you know, a little sense of misaligned priorities here. I'm gonna interrupt the SVP of the whole line of business. 'cause that's how, what we say we have to need approval for, to restart, restart customer impacting services in the middle of the day. Um, and, uh, 'cause of scar tissue. Something bad that happened in the past, <laugh>.

00:07:24

So, uh, you know, um, Susan, the SVP, she tries hard, you know, I mean, she's been many years since she's been near the, uh, uh, the non email end of the keyboard, uh, because she's, you know, dealing with customers and with, you know, other folks and budgets and says, well, customer impact. So they get the VPs together, right? The chief of staff gets 'em on the phone, says, is this gonna be a problem? Eh, it's a, you know, it's a service. Microservices should be, should be great, right? So, uh, you know, they're interrupting everyone's life. Uh, more context switching. Do they really have the knowledge of this? I guess they do, right? Um, so let's just bing approve the restart, right? So finally, it's five o'clock at night, comes back to Melissa, the middleware manager, says, who knows these production services best? Everyone's like, oh, that's Ellen.

00:08:06

Or, where's Ellen? Oh, we just put her on the plane to Europe to help with the, uh, the new, the new launch. It's like, well, who we got? Who knows it next? Oh, Scott. Well, Scott's only been there for a couple months. He's a pretty handy guy. And he is like, oh, man, okay. I guess that's me. So first thing Scott does is, uh, well, everyone's waiting and there's silo knowledge, right? 'cause who has this knowledge? And then Scott goes, well, okay, here I go dumpster diving, right? I'm gonna go into the SharePoint, I'm start looking for the Wiki and looking for things. And I'm going, okay, all right. I think I got this. I figured this out. This doesn't look too bad, by the way. Context wagon getting bigger and bigger, right? And the, uh, the salaries are getting bigger and bigger. The people in the context wagon, Scott starts going, you know, and says, okay, all right.

00:08:45

Chart one service after another. Right? It's manual process here. You know, it says, Hmm, bar service, waiting for Acme service. Hmm. I don't know what those things even are, right? 10 minutes later, Acme startup failed. You know, oh my God, right? You know, what's going on? Why me? Why me? You know, why is it so damn hot in here? You know, everybody's looking at me. It's just like, oh, this is, this is horrible, right? So, emergency escalation, ah, you know, bar app service timed out. I, I can't connect to the Acme service if I go onto the network. I seem to be able to connect to it on my own. I don't know what's going on here, right? So, uh, escalate the bar. SRE comes flying in. That's Linda says, Hey, uh, you know, because our new DevOps program, we have these environment pre-flight checks that fail.

00:09:27

If you don't have all your, uh, dependencies ready to go, looks like Bar can't connect to the act, to this Acme service, okay? Update the ticket. So we're gonna update the ticket, uh, more task switching with both the network SRE team and the, uh, the bar lead dev. Uh, 'cause one of them, maybe I'll fix this problem. And the bar lead dev says, ah, no problem. I, I can comment out that test, uh, that'll take my CD pipeline. We'll take me through to QA and then we'll get to change management folks. And he's like, okay, that's not gonna, that's not gonna work. Let's try the network folks. Right? Uh, but the network folks aren't, um, aren't answering, right? Because why a little interlude here. Uh, the business managers, um, called Melissa says, what's wrong with your services? And, uh, in an epic bout of finger pointing says, it's the network, right?

00:10:10

<laugh>. And then, so the business managers then call the network folk. He's like, what are you doing about this? What are you doing about this? Right? And they say, don't worry, we're working on it. Well, low, uh, unbeknown to them, there's actually a network outage somewhere else in the company. So the network team is getting a, a, a, uh, signal that do not answer revenue mails calls, fix this network problem because the business managers are yelling at us, even though they're working on the wrong problem. Luckily, in a, uh, bout of, uh, some heroics, which, uh, sounds good, but actually means that something's wrong with the process. Scott. 'cause he was new, did his, like, you know, good culture, um, cycle and had beers with, uh, Carlos, a network director from the network team a couple weeks before, still had a cell phone number, said, Hey, buddy, you know, it's me.

00:10:50

Remember me? Can you help me send somebody our way? Right? Uh, so they do. That's Harry. Harry opens the, he says, uh, this is a, your traffic's being blocked by the firewall. Take it up at the firewall team. Right? Urgent firewall request, right? Uh, more escalations, more interruptions, more task switching. Freddy, the firewall engineer shows up and says, can't be the firewall engine, uh, problem. It's Tuesday. We haven't changed the firewall since Thursday. That's the day we do it. Uh, and Scott's like, no, it must be the firewall, right? So again, siloed knowledge of how things work, context wagon getting bigger and says, ah, you're right. There was a rule change last Thursday that would stop Barr from talking to Acme. 'cause we were told he didn't have to do that anymore. Um, they said, well, can you change it back? And he said, sure, on Thursday, we'll change it back.

00:11:30

And it's like, are you kidding me? Right? The chief of staff is like, Freddy, we've got customers calling, and, you know, people are livid, right? So all this extra process, misaligned priorities, they finally escalate. Uh, Nicole from Netec comes in and says, whoa, this is a production change, right? So I need to have, you know, the other cab folks who weigh in on this. And everyone's like, are you crazy? And then finally, someone says the magic word. I'm gonna call Susan the SVP, right? And next thing you know, ding, firewall rule change is, uh, has, is, is approved. Um, uh, we've escalated more waiting. And, um, so Freddy Scott, Bob, go through their cycles, change the firewall, restart these services, they go think it looks pretty good. They go, you think, how do you, how do you not know? It's like, well, they don't really trust us to do test our own work, right?

00:12:15

So it's partially done work. We gotta escalate to the customer. One of the customer engagement managers who has the right tools to check these APIs to see if they're working. Okay, fine. 9 45 at night by now, who are we gonna escalate to, right? Well, a news's, Varsha, where's Varsha? True story. It's a birthday Varsha ISS out at her birthday, right? Massive life interruption. Varsha, can you come home and, uh, you know, run these, these tests? She finally does and says, Hey, this looks pretty good. Um, services started. Okay, everything looks green, you know, woo, we've, uh, finished our, uh, incident, and then the very next day, it's the kicker, right? Wait for it. Susan calls the meeting, whose fault is this? Why are we so bad at change? What additional approvals and processes are we gonna put in this place so this never happens again?

00:12:58

Right? That is the other <laugh>. So a little, does this sound familiar to anybody? Has anybody ever seen anything like this ever? Yes. Maybe. All right. Thank you. Good. The audience shot for that one. So, you know, and then of course it comes up. It's like, well, you know, someone says, Hey, we've done all this work, right? We invested in cloud, agile, DevOps and everything. Why do think everything takes so long and cost so much? Someone says, you know, we largely ignored operations, right? And, uh, you know, most companies, what they end up doing is they chase these symptoms all over the place, right? And they follow this old conventional wisdom, which some of these are my favorite. We need better tools, right? Last year was Ansible. Before that it was chef. Before that was Puppet. Before that it was CF engine and, uh, Opsware, and you name it, right?

00:13:39

So, uh, all we now, we have the same problems. We just have a whole lot more, more tools, right? We need more people. It's just a non-starter. Even if you can find the people, we're not, you know, we need people to do more. We need people to be able to accomplish more. We're not gonna get more people, right? We need more discipline and attention to detail. I, I love this one, right? It's like telling developers write less bugs, right? It's just the most asinine thing. We need more change reviews and approvals, right? This is the scar tissue on top of scar tissue. Uh, we'll wait and see. What I, oh, nevermind. This is actually a <laugh> That's a joke from a different presentation. Sorry, <laugh>. Yeah. So what we gotta do is you gotta challenge the conventional wisdom about how operations works. 'cause fundamentally get down to the root, oh, sorry, almost says root cause.

00:14:16

That's not a, all spa would be mad at me for that one. Uh, we gotta change the systemic conditions that operations has been marinated in for all these, all these years, and really breaks down to these four forces that keep getting in the way, time and time again, low trust, excessive toil, silos and cues. And I will, uh, get into each of those. Let's start with low trust. Um, you know, it's this idea that who has all the context? Where are the decisions made in our organization? You see all these escalations, right? Where it's not the person touching the problem, it's usually the kind of the highest paid person in the chain that gets, you know, that makes these, these, uh, these decisions. And, um, you know, John Oswell, there he is actually, uh, at this conference, uh, it was last year, maybe, um, you know, he did this great thing where he said, who thinks this is dangerous?

00:14:59

Right? And I know some people out there, he kind of puckered up a little bit, like, who? Right? And, you know, but what if I told you, you know, but this is, this is just a, uh, something we run periodically on a con on a, a content cash, right? It's not, not that dangerous, right? Who thinks this is dangerous, right? I was just changing text here in this, uh, or, uh, you know, the capitalization. Well, it seems pretty innocuous, right? What if I told you it was a status check for a load balancer, right? Suddenly, it's kind of dangerous, right? And this point was, is that it's always, it depends. All of our work is context specific, right? So what are we doing? The people who have all the context to actually know what's happening are over here. And the people who are making decisions are actually the ones that are starved of all that context.

00:15:39

And that it depends, right? And the key part of also this low trust issue is the notion of psychological safety, right? And this is a great definition, which is psychological safety is a shared belief that the team is safe for interpersonal risk taking, right? Defined as the able to show and employ oneself without fear of negative consequences of self-image status or career. Or as Sidney Decker also spoke, this conference says it's how easy can you tell your boss bad news? And how easy can your boss tell the organization bad news? And why this is important is, if you look at high performing teams, you know, Google cares a lot about this. And they did this big study, and they looked at all the things that could mean high performing teams, all their Google Nest, all their tests, all their PhDs. And it turned out the single most, uh, the single, uh, the, the, the number one predictor of, of organizational performance with psychological safety, right?

00:16:25

There's a whole study on academic study they did on, it's been in the New York Times, all those things. So, you know, trust and safety go hand in hand. And when it's not there, it's undermining what we do. Excessive toil, right? Another definition for you guys to read here, uh, it's Vivek RA from, uh, Google. And I think it's just fantastic. 'cause it puts a name on a problem that we've always felt, which is toil is the kind of work tied to running a production system that tends to be manual, repetitive, automatable should have automated it. Tactical, devoid of enduring value. That is the key one to me. It's not adding enduring value to the company. And scales, literally as a service grows, right? If we have a hundred thousand users and we have, you know, 10 people running it, and we suddenly have a hundred million users, you know, how many do we have to add more people to get to that level?

00:17:04

Or the same 10 people still running it, if we did, probably that work they're doing is toil. And the opposite of toil is engineering work, right? But build work that adds enduring value, right? So toil lacks enduring value. Engineering work builds, uh, builds enduring value. Toils, rot, repetitive engineering work, being creative, um, you know, iterative tactical versus strategic increases with scale versus enables scaling and toil is can be automated. But engineering work requires that human create creativity. So why this matters is we wanna be in balance in the organization. We want say, Hey, you know, how much toil do we have? And we wanna keep that toil down. 'cause the engineering work that it could possibly crowd out is what we use to, number one, improve the business. And number two, actually save ourselves and create more capacity by reducing that toil. And if we get out of balance, what happens is seeing a lot of enterprise organizations, the toil goes to the max, and we don't have any time to improve the business, but worse, we don't have any time to fight back that toil.

00:17:59

So we're stuck in this downward kind of inevitable, inevitable spiral, uh, of how we're working. So this, this is a, a key concept for creating the capacity for operations to improve itself, right? Silos, very important one. Um, so silos aren't just a team, right? It's the idea of how you working. So if you think about 'em working in a small group, or I'm working in a, uh, like a startup, right? All the people together have the same backlog, the same context, the same kind of tooling, the same priorities. But the problem is, in, in enterprise or in anywhere, really, nothing lives in isolation, right? So, you know, we're always gonna need something from somebody else, right? And they have their own their own, uh, backlog, their own priorities, their own, their own information, right? And these creates all these mismatches, right? And these mismatches start to happen.

00:18:42

We start to turn more inward to worry about optimizing ourselves because somebody else is kind of misaligned or screwing or screwing us, uh, screwing us up. And so that brings us to queues, right? Uh, because how do we cover for these disconnects and mismatches, right? That happen between the silos? Well, we drop this thing in called the ticket queue, right? And we do it quite liberally. Uh, 'cause it's so easy. Thank you, ServiceNow. Uh, and, uh, you know, we all know how well this works, right? It's the same. It's the, you know, I opened a ticket, I'm waiting for something, I gotta get a a PM to go and escalate it for me and get something back. And it's not what I needed. And then I gotta fix it. And maybe it's, or maybe I, I did need it, but don't need any more and need something else, right?

00:19:19

We go around and around. And the truth is, if it feels expensive, is because there's been a lot of research outside of, uh, operations that shows that, um, you know, that ticket Q or Qs are expensive. This whole queuing theory, this whole area of mathematics about this. And Don Reson has a great, uh, wrote a great book, uh, about all sorts of things. The principles of, uh, lean product development flow kind of gets into the math and the physics behind a lot of things that we talk about. And he basically, you know, states outright that, you know, the science is in, right? That, uh, you know, cues create, uh, you know, expensive way to manage work, longer cycle time, increased risk, more variability, more overhead. We've gotta manage these things, lower quality and less motivation, yet we keep dropping these all over our organization and, and our work.

00:19:59

And also, you know, when we take, we've done all this work to build these value streams, to build these cohesive, cohesive pictures of the system, or mental models of the system we're trying to work on. And when you take that into queues, you're just, we're just blowing that apart, right? Reverence is getting little pieces of the picture. And we're obfuscating that value stream. You've worked so hard in these DevOps conversations to, to build. And tickets are also these snowflake makers, right? Snowflakes is the fundamental notion of it's technically acceptable. You know, you drop, you get a ticket, you drop in and do something, and you take another ticket, drop in and do something else. And they're a whole lot of one-offs. And they're technically correct at the time, but they're brittle, they're unreproducible. And to mix metaphors, it's kind of like just shooting, you know, time bombs in flames across the, uh, the rest of the, uh, the organization where you can, you know, the next person coming along will, will run into something they didn't expect, or they, their automation will just be a little bit off.

00:20:46

And, uh, you know, know, the only worse than automation that doesn't work is automation. It's just a little bit broken, right? You know, as they say, the error is to human. To create a disaster, you need a computer, right? So those are the forces. So what can we do differently, right? Uh, low trust. Look at the easy ones. First, low trust, easy. Just shift left. The ability to take action. I know it sounds easier than it is, it's a lot more to it, but the ability is how can we empower these people, right? What enablement in terms of kind of process and trust, and also in terms of tooling, the guardrails, can we give to let the people closest to the context make the decisions. They have the context to make the decision. If we give them the right tools and the right guardrails to do it, let's empower them, uh, to take action.

00:21:23

Uh, excessive toil. Another one that's pretty straightforward. Uh, a lot of, um, you know, number one, track the toil levels for each team. This, most folks do not do this. It's not time tracking. Just give a good sense for on a monthly or quarterly or whatever basis, how we, how is the team doing? Set a toil limit for those teams and fund the efforts to reduce it. FA team toil limit is more than a limit that you'd set. Like the kind of industry benchmark right now is 50% then swarm to it and invest in figuring out how can we reduce that toil because we can free up the human capital to actually do something of value for the business. And, um, this really came outta the SRE movement. Um, there's a lot of great things about air budget, service level objectives. Um, Google wrote a great book on it.

00:22:01

Um, I actually wrote a chapter in the new O'Reilly book, um, that just, uh, came out called Seeking SRE. That's a nongo version of SRE book. It's a lot to do there. So, silos, um, obvious answer, right? Well, let's get rid of the silos, right? So we talk a lot about this in the DevOps conversations. Cross-functional teams. Um, but the idea, it's not everybody does everything right or superhuman that can do everything. It's about creating this kind of horizontal, horizontal shared responsibility, uh, across the organization. Not everybody does everything. And the key to this is, if you look at, I'm gonna just kind of use it in very loose terms here. There's like a Netflix model, which is, ah, everything is these, is these teams, right? And, uh, you know, we don't have a central operations organization. Everything is these, um, these cross-functional teams, and they have high, uh, they do well with that.

00:22:44

And then you have the kind of more, the Google model, which looks more like traditional enterprise. We have development teams and we have an operations organization, right? But they put in place these, these, uh, things you could read about in the SE books about the clear handoff requirements to get into operations, operations or SRE is not a right, right? You have to prove you're worthy to get there. Uh, and they have air budgets with consequences to put, put, uh, responsibility back on development. So same high quality, high velocity results, very different organizational model, but it's all about building that shared, um, responsibility. So then what about the cross-cutting concerns, right? And, you know, this is where we get right back to, well, we can't put, you know, uh, everybody on every team or every specialty on every team. So we start to have more queues, right?

00:23:23

And then we gotta cross talk between these different, these different teams, more ticket queues. We're right kind of back to where we started. So how we deal with these queues, um, the trend that we've been seeing in the, in the industry, we call it self-service operations. That's kind of us trying to, trying to, uh, document this, this, uh, design pattern. But really the idea is how do you take the things that operations we need to do, right? The environment provisioning, the restart, the health checks, uh, um, you know, clearing caches, scaling operations, uh, security checks, turn those things into pull based self services so the people who need them can hit them and use them, um, on demand. And also make it so the people who are in the organization who are building, whether developers or platform engineering teams, that they can able to build their own self-service capabilities, hand 'em off to operations and security.

00:24:07

They can do code reviews and vet them, and then turn around and give access to the right people. The idea is stay out of these teams, out of these teams way. And anywhere you can't get rid of that handoff. Don't put a ticket queue there. Put a, uh, you know, a self-service, uh, interface. And it works with any org model, right? It works with the, the kind of, you know, cross-functional team model. It works with the sort of, uh, standard dev and ops model. Um, and the idea is that people are focused on building the platform, building self-service capabilities, um, that gets rid of the ticket queues. It gets rid of the excessive toil, and people are happy. 'cause you're putting, you're turning them back into, uh, into, into operators. And it's also a great place to build security and compliance in instead of making it something where, um, you know, it's, it's like a, it would be anti Deming to say, you know, you're gonna have quality by external inspection.

00:24:51

Build it into the system. Build it a place where, this is where we put security. Uh, this is where we, uh, put our, our compliance rules. All the evidence collections happens, uh, automatically. So people always go, oh, tickets, I love tickets. Right? Well, are tell all tickets bad and no, just use tickets for what they're good for, right? Number one, they're great for documenting true problems or exceptions, right? Trouble tickets. That was the whole, uh, the whole kind of point in the first place, right? And for compliance reasons, there's a lot of routing for necessary approvals. That's just a disaster and email, and you don't want to go into another tool. Um, so tickets are pretty good at, at that as, uh, as well that human to human kind of approval chain process. You need to, uh, to have, but we have to stop using tickets as general purpose work management systems, right?

00:25:31

This is where it really gets into, into problems when we use these ticket queues, I should say it's not tickets, ticket queues as a work permission system to run, to run people's, uh, lives. So a couple of, uh, examples from this conference of companies putting self-service principles into place. Um, a few years back, uh, Jody Mulkey, the, uh, CTO then of, uh, a Ticketmaster, uh, talked about, they had a problem, which was, uh, you know, Ticketmaster as an outage. It's not like tech country news, right? It's, it's New York Times News when the Yankees can't print playoff tickets. And their average MTTR the average time to repair A-A-A-A-A public web facing outage was 47 minutes, right? You can imagine that's, that's a long time when people are, are, are irate <laugh>, right? And they're already, and they're already starting not so happy because you didn't give them Adell tickets in the first place, right?

00:26:16

So they talked about, you know, self-service in this way of saying that, how do you empower that knock? How do you empower the operations teams to take action quicker? 'cause instead they had what they called a bunch of escalators, right? It was just people picking up a phone and calling somebody else, calling somebody else. A lot of that time was the escalation. So what they did is said, well, let's take these self-service capabilities, let's start with them, and then dev and QA build up and have them test them, have them deliver them, and then when they deliver them, they're empowering the knock and, and the level one ops teams to take action. They even got to the point where they were even putting, uh, links in their monitoring tool to their, you know, self service operations tool to be able to say, Hey, run these things first before you do anything.

00:26:52

Uh, anything else. They've now even evolved past this, where they've kind of decentralized a lot of that knock activity anyways. And power the new sort of DevOps teams and help the old mainframe teams. They got a huge spread. I think the first, uh, bit of their code was checked in something like in 1976. And so after the support at the Edge program, uh, it's like 18 months or so, they went from 47 minutes to 3.8 minutes, right? I know that MTTR doesn't mean they won't still have outages or whatever, but on average, uh, this is how things, how things went, uh, definitely a big in improvement. Uh, Sean Norris, uh, spoke on, uh, this stage, the one in London, um, recently about a different kind of strategy of using self-service about in, uh, you know, high compliance, right? High compliance environment, improving consistency. And he, and, you know, he works standard charter bank, right?

00:27:37

It's 160 5-year-old bank. I think you said Queen Victoria actually signed their, their charter, everything was optimized for compliance, right? They're 80 something thousand employees. They're in 60 plus countries. Imagine how much, you know, regulation you have to deal with in 60 different countries. And they use self-services as a way to not only help standardize operations, so all those thousands of people in operations can do things consistently and see better results, but as a way to bake in compliance, because they have this, you know, where I think if you're a customer, you would agree this strict rules over how to do things in production and by, and you had to review every single change by managers and everything. It was very, uh, kind of onerous. And by going with self-service and baking the compliance in, I think in 12 months, they said they had 13 over 13,000 operations tasks in privileged environments that didn't require a review, right?

00:28:20

It was huge time savings. And I think he, he's, uh, speaking tomorrow afternoon here, I think, in this room as well. Um, and excited to hear what's next on their journey. But, uh, it's a pretty interesting story in how using, you know, self-service as a catalyst to drive forward, uh, so much of the change in a high regulatory, uh, environment. So where I need your help, uh, we've been trying to, uh, kind of document this self-service operations design pattern, um, looking at people, different people are doing across the industry. We've wrote, uh, I guess you call it a book. Uh, it's, uh, it's, it's got a good length to it. Uh, you can get it at rundeck.com/ self-service. You can just read it on online or you can download the PDF. Um, we need help for is reviews. Just look at it and check it out.

00:28:59

Um, you know, give us information on what you think. Give us some, uh, some design patterns, some ideas that you've seen, maybe volunteer your, your company to tell their story. Um, you're trying to make it sort of an industry agnostic kind of view of, of, you know, how do we empower operations so we can relieve that pressure so they can get to the things that will improve the, uh, improve the business. So, okay, so recap, uh, what I talk about. So number one, uh, you know, don't forget about operations <laugh>, right? Seems like, you know, convention challenge, the conventional wisdom deployment is not the goal. Can you all say that deployment is not the goal, right? There's a lot to life after, uh, after deployment. Um, really understand those forces, you know, what's undermining this operations work? Internalize them, socialize them, look, teach the organization to spot them.

00:29:41

Things will get better. Uh, shift left, right? That control and, and decision making, push the ability, uh, to take action closest to where the context is. Um, you know, learn from SRE. Uh, there's a lot of interesting things. It's not the observability and the tools or the fact that operations folks are now developers. Uh, it's the idea that it's the toil limits, the air budgets. It's really kind of a fascinating way to look at operations, focus on removing those silos and queues, and then, uh, leverage that, uh, self-service operations design pattern, wherever you can get rid of those, uh, handoff points. So I'm Damon Edwards. Um, these slides are actually already on Twitter. I saw some folk people taking ing, uh, pictures. You can, uh, find me on Twitter there. I've already pinned the slides there, so you'll get those. Email me anytime, uh, or my dms are open and, uh, check out the, uh, self-service operations book. And, uh, you know, I'll feedback. Welcome. We'll incorporate it in and give you credit. All right? So thank you very much. Appreciate it.