Engineering ITSM Through Site Reliability Engineering

We all know SRE as a growing engineering practice for IT operations. In fact, SRE is a modern ITSM framework that reimagines service management as an engineering practice with a singular goal of consistent reliability. This session will explore SRE in the context of ITSM with particular insight on how SRE approaches service level, change, incident, problem and capacity management. The session will also explore SRE as a self regulating ITSM system that most closely aligns with Agile and DevOps as three continuous flows of managing services.

JG

Jayne Groll

CEO, DevOps Institute

Transcript

00:00:13

Hi everyone. I'm Jayne Groll CEO of the DevOps Institute. And I'm super excited to be with you today at DevOps enterprise summit. Talking about a topic that's near and dear to my heart, which is engineering service management, using site liability engineering practices. A little bit about me. I am currently CEO and one of the co-founders of the DevOps Institute. Uh, you may know me from my days in the Isolite TSM space. I was one of the co-founders of ITSMs academy, long time. Uh, ITSMs expert of the last several years spent in the DevOps space. I'm also a former it ops director. So I've been in the it space a fairly long time and have had, I think, a bird's eye view of the evolution of the tech community. I'm also author of the agile service management guide, which you can access for free at the DevOps Institute website.

00:01:15

Speaking of dev ops Institute, I'd love to tell you a little bit more about us. Our mission is to advance the human elements of DevOps we're professional members association, where we try to create a safe environment for our members to connect with each other, to upskill their knowledge, to grow their careers, and then hopefully be able to support their organizations, digital transformation. We have multiple levels of membership, including a basic free membership. So go to our website, become a member. There's lots of assets and resources for you, and you'll be able to connect with other humans at dev ops as well. So what are we going to talk about today? Well, just to level set, everyone's understanding I'm going to provide a very brief introduction to site reliability engineer and particularly SRE principles. Then we're going to look at SRE really through the lens of it, service management. I'm not going to compare it to idle. I don't think that's fair. I really want to look at SRE as a standalone framework, focusing on a site, reliability engineering practices. And then we'll, we'll wrap up by looking at SRE and industry. We'll talk about the role of the site, reliability engineer, and some opportunities perhaps for you to learn more so stay tuned.

00:02:43

So it's no surprise to anybody that the last year was particularly challenging on a unit and on an organizational, uh, perspective. Nobody expected this coming out of 2019 entering into the new decade. The, that were faced across the world were just unfathomable and organizations had to pivot very quickly. Those that were able to adapt to a digital landscape, um, survive some, maybe even thrive. Those that couldn't face challenges that were never expected. And unfortunately, some organizations did not survive, but whatever this new normal is going to look like as we come out of 20, 20 and half of 20 21, 1 thing, certain digital transformation is just no longer optional. The digital landscape is real and organizations I think have learned in some cases the hard way that they need to be able to adopt a digital approach as we move forward, uh, through the next decade and beyond.

00:03:49

But if you're going to be digital, then you also need to be reliable. And I think that's where site reliability engineering really comes in, where we understand that reliability access, uh, uh, you know, the usability of a service is really the only way to truly measure value. And so we're going to take a look at some of the practices and principles that Google described in the SRE series of books. And you may not be Google that's okay. Um, but the practices and principles are modern, and I think they, they really adapt to the digital landscape, uh, in many ways, um, better. So what is site reliability engineering? And again, some of you may have some preconceived understanding or knowledge about it. I'm going to stay pretty high level. When we look at SRE as a service management framework, it all started with the site reliability engineering book authored by several members of the Google team who really wanted to describe how they're able to keep their very complex environment, large scale systems reliable.

00:05:00

And, and again, it was can viral very quickly. It addresses the operational side of the house, but definitely steps back, uh, pre pre-production. And then while the site reliability engineering book describes the practices, the principles, and it has some prescriptive guidance, it was followed on pretty quickly with the site reliability workbook. And then just recently building a secure and reliable systems as part of site reliability engineering, um, was introduced as well. You can read these books for free on the Google site. You can buy them through Amazon, if you prefer a hard or a digital copy of your own, but each of these books was intended to describe how to manage services in large scale environments, by creating roles practices and the elimination of manual work through an engineering mindset. You know, look at the description from Ben Treynor Sloss of Google as Surrey's what happens when you ask a software engineer to design and operations function.

00:06:10

So we really looking or looking at engineering operations, but you also have to engineer process and practices in order to have intelligent automation. And I think that's a key aspect of SRE as well, Google by its own definition, considers SRE its approach to service management and calls it out in the early parts of the first book and look at what an Sr read team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. Those are traditional classic it service management practices, but then rules are codified for how the SRE teams are going to interact with their environment. So not only post-production right post deployment, but also pre-production by interacting with product development teams, by interacting with testers, by interacting with users. So SRS are actually key members of the development team. The operational perspective is brought into the dev ops and the agile teams, so that by the time the code or the service goes into production, that there is a shared understanding of service level objectives.

00:07:29

And there is a shared understanding of engineering for reliability. And what that means to my mind, SRE is the capstone of a self-regulating system that started with agile software development, and then grew further with DevOps, right, build and deploy. And now looking at operations as a self-regulating system as, as well, we needed to be able to empower teams. We needed to be able to really build on some of the principles that first came out in the agile manifesto about self-organizing teams. It is, uh, an alignment, uh, that I think works very well with agile and with DevOps. And I think is the third piece of the, of that puzzle, right? Where we have these autonomous systems that are looking to deploy faster, more frequently with higher quality.

00:08:29

So let's talk a little bit about the SRE principles and I kind of divided it up into two sets of principles. One of which really focuses on the human aspects of SRE and the other that looks at it more from a tactical, a technical perspective, you know, in the center of, of site reliability engineering is the concept of service level objectives, right? That's the objective for the, of the service you might think of SLS. Um, in, in SRE, we really focus more on the objectives for the service and because SLA is, you'll see in a couple of sites really took on too many different contexts and too many dif definitions agreeing on what an effective service level objective is, gives everybody involved from the agile teams all the way through and beyond production, a shared understanding of how this service is expected to perform what its reliability is expected to be.

00:09:29

I mentioned about self-organization of self-regulation. One of the key principles here is the ability to regulate their own workload of the team. The individual engineer has to have the empowerment to regulate their workload as long as they're meeting the service level objectives. And then in order to be able to regulate their workload and perhaps to make the achievement of service level objectives easier. One of my favorite parts of SRE is the ability to have proactive time. Half of an SRS time is allocated to reactive work, but the other half of the time is allocated to proactive work. Perhaps it's automating manual tasks, perhaps it's looking at ways to improve process, but we have to be able to have the time and the resources to make tomorrow better than today. And then like we see in other frameworks, the hyper focus on continuous learning is essential.

00:10:28

Failure has to be approached as an opportunity to improve and blameless postmortems have to be the mantra of the day. We have to be able to step away from indictment and, and become a learning organization that is always looking at having time to make tomorrow better than today, the ability to regulate workload. And then of course, managing to the achievement of the service level objectives. The second set of principles of SRE really are more tactical. So it is about embracing risks, right? Intelligent risk, taking, managing to service level objectives, monitoring distributed systems. I'll, I'll tell you a little bit about observability in a bit, all focused on the elimination of toil. You know, it's foil it's manual repetitive work that could be automated if we have to do it more than once or twice, but unfortunately consume a lot of human time. And humans uniquely, at least today are capable of higher level thinking and innovation.

00:11:39

So the ability to eliminate or automate toil is essential to reliability. And so if we look at that automation, but we also want to look at simplicity in terms of how we manage our services. The two would go hand in hand, right? You'll have to have intelligent process for intelligent automation, but you also want to keep it simple, right? We don't want to have bureaucratic or difficult to navigate process or automation. And then it's all about the engineering of releases if releases are engineered well, and this is where the SRE can play a key role pre production. Then the quality of the service post-production will of course be higher. Um, and, and the value delivered to the customer will be greater. So I want to look at engineering reliability through service management. So for the next few minutes, what I really want to do is take a look at, at specific practices that you probably are familiar with.

00:12:43

If you have any involvement in it, service management, um, and how SRE approaches it. I wish I had time to do a really deep dive into each of these. I encourage you to get education, to do some self exploration about SRE, read the books, because there's a lot of really deep guidance in there. But for today, we're just going to touch on each of these practices. And I'll give you some key takeaways in terms of how a Saree approaches that. As I said, I'm not going to compare SRE to idle, but to my mind, SRE is the most modern approach to it. Service management since the early days of, of idol. And as I said, Google by its own admission, considers SRE as its approach to service management. So again, if we're going to have services and we're going to focus on reliability, then those services have to be managed as well as surgery is really about systems engineering, right?

00:13:46

It's engineering a system for managing services. That's lightweight. That's integrated that, as I said is self-regulating and there's a shared accountability across all of the different domains that, that manage in the, in the value stream, high focus on automation and again, an emphasis on being proactive as opposed to completely reactive to what's happening in, uh, in the service and in the application stack. So it's systems engineering and systems engineering is a people, a process and an automation element that make up each of these systems. It's not only about, uh, automation now, as I said, I want to take a brief look at some of the key practices that you're likely familiar with, um, in traditional it service management and how SRE approaches it. And at the end, they'll share with you some of the other aspects that, um, SRE provides tangible guidance on, again, remember you're not Google that's okay.

00:14:51

Each of these, I think applies equally to medium and large organizations, regardless of whether it's a heritage organization or an organization that was born, let's say within the last 10 or 20, uh, 20 years. So I mentioned about service level objectives and everyone in the organization is tasked with managing two service level objectives. That includes developers. It includes DevOps teams, certainly it includes operations, um, different levels of support. Everybody understands what a service level objective is, why it's important and what their role is in achieving those service level objectives. So the service is managed to its SLO, but it's measured by its service level indicators. So those are more discreet. It may be measuring the performance of the application, stack the infrastructure. It may be, uh, measuring the performance of testing or security or release, right? So SLIs are the measurements as solos or what are managed right managed to, and, and SRE while it references service level agreements really steps back from the focus on the SLA over time, the definition of an SLA has taken on so many different meanings in different contexts that, you know, your understanding of what a service level is.

00:16:20

And of course the hyper-focus on the contract aspect of an SLA has really detracted from the true meaning of the service level objective. So SRE references, SLS, but the key focus here is indeed the SLO and nothing can happen until the SLO is established rest wrestle. Those are established. Now here's where I think, um, SRE really, um, you know, kind of moves the needle in terms of change management. If we know what the service level objectives to be achieved are then we can also now establish error budgets. So that changes member embrace risk. Well, it has to be embraced intelligent risks, error budgets are then defined where changes can pretty much happen is at will. As long as the service is within its budget. You know about budgets. You may budget your own personal finances, and if you stay within your budget, you're fine. But if you overdraw, you're not so fine, right?

00:17:24

There's steps that you have to take. Well, the same here and error budget is meant to be spent. It's part of the self-regulating system where the team can agree that the code or aspects of the service need to be deployed, right? So it basically raises the definition of a standard change. But if the budget is breached, then there are consequences and the consequences don't only affect SRS. It affects the development, uh, schedule. It affects a lot of other things that happen all along the value stream. So as long as the, that the team is staying within the Arab budget, then again, changes can happen as well. And there were thresholds and, and whatever, highly dependent on automation, but it does increase the velocity of the releases. In many ways, it removes, uh, some of the human elements like the change advisory board or the change approval board, right?

00:18:19

It reduces the number of people that have to touch a change before it can be deployed. And it empowers the teams to be responsible for their own quality, right? Nobody wants to release something that is, is low quality and avoids issues like, like fatigue or contempt for change management, or even the desire to circumvent existing process. I personally, I think it's one of the coolest aspects of, of SRA now event management is growing up as well. And so monitoring certainly is a key element of understanding how the services performing and monitoring still very, very much exist. Right? We have to be, we have to look at reaction to the performance, whether it's latency, what the traffic is like, right? Any errors that have occurred saturation, but those are mostly reactive. Now introduce another level known as observability, where it is taking both an internal look at the individual components to the service, whether it's application or otherwise.

00:19:26

And then also looking at the outside in right, taking an external perspective of observation and developers can observe their code and DevOps teams can observe the code. And certainly operations teams can observe the code. Observability is really rising as a new and interesting practice. I encourage you to learn more about it and then capacity. Management's always been one of those practices that happens, but nobody's really sure who's responsible for it. Well, in SRE, the SRE teams are put in charge of capacity planning and provisioning because again, capacity is going to drive reliability. And so understanding organic growth, right? Natural usage of the service, um, that, you know, happens because of more transactions or just normal day-to-day business. And then inorganic growth that may be a venture event. It may be seasonal. It may be certain things that are happening at a time of the year.

00:20:30

Um, but it looks at, uh, capacity consumption first based on certain kinds of events in today's world of elasticity, managing capacity and having the skills and the, and the guidance to understand how to manage capacity is important. It is critical to availability and therefore it, uh, you know, SRE assigns that, uh, responsibility that accountability to the SRE teams and of course, working with others as well. And then incident management, mostly incident, major incidents sets up an incident command system. So very structured sets up a command post, identifies an incident commander. Who's going to really structure how we're going to respond. Who's going to do what right. Removes impediments. It's almost like a scrum master, right. And keeps a living document. And so that living document is available to everyone. Working on the incident could be a ticket and a ITFM system, but it avoids some of the delays that are associated with, with a traditional escalation, right, by having a command post, particularly in, in events that are, or incidents that are, uh, have a major impact.

00:21:47

It also, um, affects how on call happens. Right? So having an incident commander, uh, knowing, you know, the SRS are usually the ones on call. So being able to, to, you know, manage on that is, is essential as, as well. So again, very clear incident response, but avoids that kind of management by running around, particularly when you're in a major incident situation. And then in, you know, the goal of incident management is to restore service. So being able to identify and remove, uh, the root cause of those incidents requires some guidances. Well, in SRE, we don't call it problem management it's called effective troubleshooting, but it takes almost a medical approach, right? A scientific approach of triaging the situation, examining the symptoms, uh, diagnosing at least coming up with the first diagnosis, testing different treatments and ultimately finding, um, a cure. And this is where blameless postmortems are so essential.

00:22:53

They have to be a key aspect of SRE culture where we move away from indictment, right? Who did this? What caused it and, and look at what did we learn and how can we avoid this in the future? How do we remove the root cause permanently? So, uh, future incidents, uh, won't happen. And so again, movement to a blameless postmortem, I think is also part of the autonomy or the empowerment that these teams will start to feel. So those are basic. Um, its M practices you're likely familiar with SRE also provides some pretty tangible guidance on emergency response, load, balancing security, uh, software engineering, uh, as an operational practice product launches the human skills of communication and collaboration and managing operational load. And all of this has to be engineered, right? You may think of engineering or engineers specifically when it comes to enterprise architectures of automation, but this is systems engineering and systems engineering is a people process and a technology technology elements.

00:24:09

And each of these is a key contributor to the quality of the service itself. So the question I often get asked, particularly by those in the ITSs community is, is an SRE more technical than traditional ITSs. It's an engineering practice. Well, as I just said, it's systems engineering, right? We have to engineer the systems of service management. If we put a durable focus on engineering, where we understand the intelligent process that's needed for intelligent automation, we embrace the principle of elimination of toil and optimizing of for automation, we develop, uh, more technical skills like Python, right? One of the top skills for site reliability engineers, uh, the ability to write scripts, right? If you're on call in the middle of the night to really look at it as an engineering role, then the answer is yes, but we are it right. We are information technology and all of us, and I'm the least technical person in the room.

00:25:16

All of us, right. Have to be technical at some level. But again, it's mostly about developing an engineering mindset where we look at ways to improve. And we look at the, the service management architecture as a system of people, process and automation as surgery is on the rise. I mean, three years in a row, DevOps Institute has run the upskilling community survey and report. The new report was recently released. And year over year, we're seeing an increased interest on the enterprise level for site reliability engineering as an operational practice of individuals in the organization are moving into SRE roles. They're actively learning about it. According to LinkedIn in 2020, it was the fifth fastest growing role that organizations were looking to hire. It's not necessarily a one-to-one, you know, one SRE to, uh, an SRE, uh, to a development team, but it is becoming a very, very key role and key teams, most importantly, key perspectives within, um, the, the enterprise, uh, particularly it's a real job, right?

00:26:36

Over 10,000 jobs were posted in the U S recently, most of which were paying $10,000 or more, right? There's some upskilling that may be necessary for you. If you're looking at moving into an SRE role, some software engineers from the development side of the house have moved into SRE roles, systems, administrators, right? Systems, engineers, automation, architects have also moved into these, um, SRE roles. So it's a real job, right? It's a real job. It's got a real job description and I encourage you if you're looking at your own personal career growth to learn more about SRE as a role, right? Some will call it a reliability engineer. Um, but, uh, there's an offshoot of network, reliability, engineers, and customer reliability engineers. But at the end of the day, the core practices and principles are very, very similar. So with that, I want to thank you.

00:27:37

Uh, you know, as I said, I've watched the evolution of, of the it community for a long time. I think SRE was born out of real life practices. Uh, as I've said, several times, you probably are in Google. Maybe you are. Um, I think there's a lot of good knowledge and education. Um, there's peer to peer, um, you know, shadowing, Ryan SRE, shadow developers, developer, shadow sarees. Um, I think it's really cool and innovative and doesn't serve plant existing or throw away existing ITFM practices. Remember what I said, no framework is perfect, right? Esri's not perfect. Idol's not perfect. Right? Agile is not perfect. Uh, dev ops is not perfect. Um, you know, as humans, our mission is to adopt and adapt. So I hope today you've looked at engineering, it service management through the lens of the site, reliability engineer practices, and maybe it sparks some ideas for you in terms of how you personally, or your organization can start to, uh, to move forward, particularly if you are in, in, in the midst of a digital transformation. So with that, I thank you. I thank the it revolution team for inviting me to, uh, be here again. It's really a delight and I hope to take your questions and I hope to meet all of you in person someday soon. Thank you very much. I'm Jayne Groll, CEO of the DevOps Institute. Uh, appreciate it and have a great day.