When Is SRE Right For You?

Stephen Thorne from Google Site Reliability Engineering presents a framework for deciding which of your applications will have the most fertile ground for developing an effective SRE engagement.

ST

Stephen Thorne

Staff Site Reliability Engineer, Google

Transcript

00:00:02

My name is Steven Thorne, and I'm gonna be, and I'm, uh, a site reliability engineer at Google. And I'm gonna be talking to you about when is site reliability engineering right for you? So I always like to start with an agenda, a brief outline. Um, I'm gonna be giving a quick introduction so we share some terminology. I'm gonna be defining what SE work is to introduce a frame and then introduce a framework for deciding where and when SRE is best applied in your business. I'm gonna give some examples of, of systems which, uh, SRE might, may or may not be, be more suitable for you, and also some examples of where you might wanna do it anyway, even though they're not the, the most suitable systems. And then I'm just gonna repeat myself in the conclusion. So brief introduction to SRE. I'm not gonna go super deep into anything, but I just wanna make sure that we're on the same page.

00:00:57

Site reliability engineering is all three of the following things. It is a set of principles that any team can use to do site reliability engineering. So any team can do site reliability. Engineering as a success of principles. You can use it to run your systems in production. It is also a way of operating in production, meaning a job function, uh, focus on operating your, your applications, your software of meeting the needs of your users. You can hire site reliability engineers and is also a way of structuring an organization. It's a way of structuring an organization which owns, um, success. Success in terms of delivering an application that your users are happy to use. They, they find reliable, fast, efficient, um,

00:01:45

Google needed to tackle the problem of it costing more and more to operate large, important applications at scale. And that's where Site Reliability engineering started. We use it at Google for the biggest, most important pieces of our infrastructure. The value SRE provides us is as we get more and more complex products that more and more customers use, we have a team dedicated to keep the costs of operating that software under control. Just because we get 10 times as many users this year than last year, doesn't mean we need 10 times as many staff to keep the application running as we aggressively apply toil budgets and drive in innovation, in production practices. We might even shrink the team over time or get those teams to take on more applications. So I'm not here to tell you how to do SRE in your organization. What I'm trying to do here is give you an idea of when and when SRE is going to be right for you. Um, if you wanna know more about the background and fundamentals of how to do SRE, um, I'd like to refer you to, uh, the books we wrote, which will give you much more detail than I can manage in the next 25 minutes.

00:02:58

Focusing on the job role of SRE, what do site reliability engineers actually do? They figure out how to get to the next slide. So site reliability engineers run the application in production.

00:03:22

Running an application means doing its releases, its capacity planning, its mi migrations from one cluster to another, uh, provisioning it, installing it, upgrading it. In order for an SE team to be effective, they must time limit how much they time they spend on what we call toil, the manual pieces. We say that by spending a maximum of 50% of our time doing toil, we have the hedge room required to do the project work to burn that toil back down, to do that automation, to do those projects, to do those things that, that make things better. Error budgets, site reliability. Engineering has reliability in the name. This is not about making things infinitely reliable. It's not what we're about. What I like to say is we're about making things reliable enough. The way we advocate for reliability in an SRE team is to apply an error budget. We say the system isn't meeting expectations, so let's fix it. Or the application is meeting all its reliability goals. Let's focus on improving velocity and decreasing the cost of running it. So if you, you've got error budget left, you, you, you, you have a sufficiently reliable system, it's reliable enough, you're actually there to increase development velocity and help everything move smoother.

00:04:36

When SREs are doing all the operational side of the work, or a sufficient amount of the operational side of the work, they know intimately the monitoring and are working on improving the system day to day. They're also the best people to do emergency response. When a system is in trouble, we page the SREs first because they're the people who can mitigate the, the problems fastest. And the last and most important thing we see as part of SE work activity is decreasing costs. When toil has been decreased, the reliability addressed, the emergencies handled, we ask how can we decrease the cost of running this, both the human cost, how many people on the team are required, but also the resource cost. Perhaps we can run this with fewer VMs. As you can appreciate, sometimes the best thing for a successful team who has managed to do all of this is just to give them more work to do they've been successful, let them be successful with more systems.

00:05:33

So I'll blitz through some definitions and now I wanna introduce a framework to you. This framework is sort of designed to allow you to make a judgment about where it's best to start doing SRE in your company in context. And I don't wanna sort of give you a checklist and you, if you iterate through this checklist, you find the correct system to work with. I'm, I'm, I'm giving you ideas along these dimensions. Site reliability engineering is pretty special. You've got a team of generalist operators who can program automate, do emergency response cost, optimize training them on applications is gonna be time consuming. And as a result, you wanna make sure that they're doing worthwhile engineering that provides value to your business. And before I start presenting this framework, I just wanna give a quick validation to anybody present who actually currently practices SRE you are special and wonderful.

00:06:23

The goal here is to help decide where the best places to start doing SRE or focus on where to go next. I don't know your organization, your context. None of these decisions happen in the vacuum. Just because you don't hit any of all of these points doesn't mean you're not doing SRE, uh, or that you can't be successful. So when is SRE right for you? When is site reliability engineering right for you? I think you need to hit all three of these points to be maximally successful. The closer you get here, the better mission critical. Is it worth it to the business to invest in site reliability, engineering operable? Can the SREs do anything on the system to maintain it? What, what about when things go wrong? Immutable. Can SRE make it better over time? Mission critical, operable, mutable, these are three dimensions. They're not binary yes, no, but they're they're relative. You can say this system is more operable than that system.

00:07:21

The first and most important attribute of application when, when considering applying S3 principles is how important that application is to the business. The amount of funding and effort and application deserves to maintain its reliability should be proportionate to the value of that application to your business. The more mission critical the application, the more investment in its reliability is warranted. So what makes something mission critical? Um, your customers will notice it impacts your revenue. Um, all other work stops. You can even call your CICD uh, pipeline. Uh, mission critical. If you decide that there's enough business value there in keeping that up and running and not then when working. So, what questions can you ask to assess the criticality? If you had to choose one to say first, which one would it be? Um, perhaps it's entirely an economic decision system X independent of and makes more money than system Y. So do that first to judge how important a system is. Think about questions like, should I wake up at 3:00 AM to fix this system? What about 4:00 AM What about 6:00 AM? Like where is your value judgment here? How important is this system to you, your customers, your business?

00:08:35

I think mission critical systems are the best place to start your SRE journey. Next one, operability. The operability of a plat on of an application refers to how a team might interact with the system, both in the normal day-to-day maintenance and how they might be able to perform fixes when things go wrong.

00:08:57

So what do we mean by operability and why is it important? If you task a team with running an application in production and make them responsible for it, they, they need to have confidence they can do their work. If day to day you have the ability to, uh, shepherd your releases, scale your resources, when you have a success disaster and you have to scale up rapidly or a bad release and you have to roll back, you know exactly how to get that done. Operability means that actions can be taken by someone who's not the person who wrote it in the first place. You can run it, maintain it and debug it. It there are things you can see in an opera operable system. You see monitoring that you actually like. You have confidence that you can fix outages, a high level of confidence that your procedures work and a team member can execute them without help. Things that make operability harder could be no visibility in what goes wrong. Fixing errors requires an expert developer to write a fix and release it in a roll forward. Uh, rollbacks are typically boxed by data migrations and things like this. Um, you might have a problem with, uh, vertically scaling your compute, which means that overload is much harder to, to mitigate because you can't get a bigger machine than that. Um, single points of failure exist and the only recovery is to restart, rebuild, or redeploy that single point of failure.

00:10:18

Operable systems mean your SREs can actually do something with them. When you have an operable system, your SRE will be able to provide value day to day and actually do that toil and fix those outages and do something with that system.

00:10:34

And I think the last thing you need in order for your site reliability engineering team to be able to provide value once they're working on a system, is a way to make that system better over time. And you need to enable that. The mutability of an application refers to how possible it is to change the application in order to make it more reliable or decrease the cost of operation. If a site reliability engineering is spending a maximum of 50% of their time doing those operational aspects that toil, then the remainder of their time should be spent making things better. And that's why we need mutability for something to be a a system which you would apply site reliability engineers to. Mutability is a dimension that you can actually achieve in many ways. Can you redesign it? Can you change it? Do you have plans to improve it? Are you going to task your SREs to do those improvements? You've got measurements about your system. What are you gonna do about those measurements?

00:11:31

To assess how mutable a system is, a system is, um, you need to think, what am I going to do to task these engineers? Site reliability engineers need projects and goals to accomplish beyond just keep it running. If they decide that a re-architecture is the way to meet our reliability or efficiency goals, is that ever actually gonna work in that organization? Are they gonna be respected? Do they have a a seat at the table? Are are you gonna enable that? Um, will they be able to fund their work if they, if they need to spend more? Um, and it's a good question. What happens when you get a hundred times more users? Will this team be able to handle it or will they be powerless and they just have to throw up their arms and ask for help?

00:12:14

Mutable systems mean your SREs. Working on those systems can provide long-term value. If they're not providing long-term value, then you, they're hamstrung in their site. Reliability engineering, they can't provide value long-term. They're just operating the system. So I'm gonna run through some example cases now. Please bear in mind that these systems here are entirely, are examples. Um, and the value judgments of my own and your systems might be treated entirely differently, architected entirely differently, but have the same names. So first is something I hope is a good example. Ad serving is the sort of thing that either works and ads are displayed or it doesn't, and then users get slow loading pages. So it's pretty mission critical, probably easy to justify this in your business. It's operable. There's lots of servers 'cause you've gotta serve lots of content to a lot of users. You have to provision them, scale 'em up, tune them, monitor them.

00:13:07

Is all of this work to do in order to keep this fleet running and it's mutable. The team running production can think of ways to run more efficiently to do safer releases, decrease batch size, scale faster and more efficiently, and reduce their toil and operating system. So I think ad serving is sort of a, a good first example. But then my second example is something that's much harder to SRE. Now bear in mind everything here is relative. You can SRE anything. But the question is, is it worth it to your business Networking devices and I mean specifically vendor networking equipment, it's probably entirely mission critical. Um, if it's down, nothing else works, but you don't really have much you can do about it. Especially during a business day. It's probably just gonna be safer to wait for a quiet period and do safe schedule maintenance on the weekend.

00:13:52

You know, this is a model that we're very familiar with that's immutable. You can kind of think of project work to do, but it's largely gonna be how do we add redundancy or test our configs better? It's not gonna be the sort of thing that the business will, will find a lot of value in. So if you ever SE your networking hardware at all, it's probably gonna be something you do further down the road once you, once you've done everything else. So hosted hosted software, this is software you you run, but you didn't write yourself. Your business might be able to handle your C-R-E-C-C-R-M being down for a few hours or it might be a critical dependency on your business. So the mission criticality depends very much on how you use it and what it's for. Operability apart from making sure it doesn't run out of ram, um, there's really not much to do. You just keep it running. That's immutability isn't the sort of thing you need be start. They're gonna be tasking your site reliability engineers to do much about 'cause it's somebody else's soft software that you're just hosting. So perhaps there's something that's better. So I threw the, in this example, um, I think it's likely your HR website isn't going to be mission critical enough to justify an SRE take team taking care of it. It might even be operable, immutable, but is it gonna justify the staffing? Is there something better to consider A web shop?

00:15:05

This is probably just as critical as the network hardware example. Your customers are going to go elsewhere if they can't access your website, your web shop. So operability, there's all the routine maintenance tasks. Um, plus you have to think about capacity planning for big events. Uh, black Friday or Boxing Day sales can be a big deal in the, uh, in the web shop space. And mutability will likely to be a large number of projects you could do based on what can go wrong with the web shop. You might need to work on making it more reliable, protect against internet threats or if things are going pretty good, partner with the development team to increase their velocity, a data processing pipeline. So it might be possible, uh, for it to be down for, for, for a while before it has business impact. You know, it's a data processing pipeline. Data goes in one end, six hours, 12 hours later it comes out the other, but it doesn't actually affect anything for a couple of days. But it's also gonna be pretty operable scaling it, maintaining it, debugging bad data and troubleshooting. There's lots to do. And typical typically SE teams I've known that have engaged on data pipelines end up forming an extremely strong and productive relationship with their dev team. So this is actually somewhere where I think SE can provide a lot of value and, uh, to your business, but perhaps there's something more critical that would provide more immediate value to your business that you should apply S3 to first.

00:16:31

My last example here I think is a Kubernetes cluster. So this is the cluster or clusters that that host our mission critical workloads. And it's gonna be, you know, again, pretty much up there on the mission criticality aspect, it's gonna take maintenance to keep it going. And that work maybe need to be given to experts to get it done. So this could be your SE team mutable, there's probably project work you can do to, to help this go over time or you might just let it tick along. And the challenge here is if you have a Kubernetes cluster that all your applications run upon on the temptation is you might get SE to take on the Kubernetes cluster, but really actually you care about the business outcome of the applications running on top. So you might actually task your SRE team with running the most critical application and the platform or get two teams, one that depends on the other. And that's a sort of way that you compose this situation so that you, you end up providing value all the way through. So I've given you a couple of examples and a and a, a sort of a relative feel for like, this is more critical than that. This is more operable than that. But what if you wanna do it anyway? Um, and on a system that isn't so suitable for SRE, well, there's ways you can do that. The first way is you can always lower your expectations. If a system is not super critical. You can set lower targets.

00:17:52

Um, insight, reliability, engineering. The first thing I always think about is service level objectives. Um, the service level objectives of my system, I'm measuring its reliability. Is that reliability good enough? Is is my SLOA good target? I know that it can take a lot of work to keep a system running at four nines of availability or in other terms, uh, five minutes a month of downtime. But with the same amount of engineering time, I can probably support a dozen systems at two nines of availability, which gives me a budget of eight hours of hard downtime a month. So that means that I can take the same amount of staffing and deal with many, many more systems and it allows us to run, uh, more efficiently across a lot larger fleet.

00:18:38

So the process of vetting the application before SRE takes responsibility of it is important. This is a process that we go through at Google. We don't take for granted that an application that any application running in production is always run by SRE typically, actually we have our developers build it, build it, and run it, and then eventually they decide which applications they're going to fund SRE teams for. So when they want to staff SE on that system, the SE team ramps up on that system using what we call a production readiness review. It's this review that often significantly increases the operability of a system. It's where we document all the knobs and buttons. We can twiddle run through the checklist to make sure all the standard operational actions can be done and things have conformity and uniformity. Um, and completing the production rev readiness review or PRR will confirm the operability of our system. Your PRR is a way of making sure that yes, sure, that before the SRE team agrees to shoulder the ultimate responsibility for the success of the application, that you can make sure they're going to actually be able to provide a benefit. Once they do.

00:19:51

You can always add more failure domains. It might be expensive to duplicate critical infrastructure, but if the business sees value in it, it might be worth doing and worth doing well. I often get asked, how can I run a reliable system when all my critical dependencies are unreliable? The answer's very context dependent, but it's possible by adding failure domains, duplicating entire systems or using techniques like availability caches, that you can make it so that even though the backend goes down, you've cached the data you need or there's a second independent system that you can query allowing you to, to, to keep your production system working. Your systems might all feel immutable from the operations side today, but it doesn't have to be that way. You can enable your engineers enablement might come in various forms. It could be allowing them access and training to be able to do development alongside your software development teams. It might be bringing SRE into the design process. Introduce your SRE to product management and talk about how best your customers can be served by talking all the way back at the product management level. Thi this enablement is what's going to ultimately result in your SRE group really caring about the customer experience and tooling them up to defend that customer experience in the long term. And that's what SRE is all about. It's about aligning the operational aspects of running a system with what your customers actually want, which is a system that actually works and delivers value to them. Ultimately, what, what your business wants because your business wants to provide value.

00:21:27

So in conclusion, assess your applications, SRE, the one that matters most and is operable and mutable and iterate. Thank you. Now I believe I have about five minutes for questions. If there are questions from the audience, so happy to repeat. Have a question. You do have a question? Uh,

00:22:02

Yeah, <laugh>. Um, so, um, if we would like to start working with SRE, right? And try to convince our superiors for example, that this is the way to go, what should I start with?

00:22:17

So if the, the question is if I wanna start SRE and I want to convince my company leadership that this is the way to go pitch

00:22:23

The idea, but what should I start with? Should open up?

00:22:29

I think you should try and figure out where it will provide value in your business because you should be able to, to look at, look at some applications, some system and say either we can make this more reliable or decrease costs of running and over time, this is literally how, how much better it will be. And try and figure out how you can put it in business terms. Because if you can't figure out how it's gonna provide value,

00:22:54

Then

00:22:55

Why do I need then why do you need to do it? Yes, exactly. Thank you

00:22:58

<laugh>.

00:23:02

Um,

00:23:02

Would you say, um, start with one system or just thinking about scale, take one, you know, POC time,

00:23:09

The, so the question is, would I start with just one system? Um, I would start with potentially one system or one stack. Like you might say, I will take responsibility for our most critical application plus the Kubernetes cluster plus whatever dependency, and then subdivide and I would iterate. Um, you, you want to have people in place that can actually defend the customer experience. And if they're, if they're not empowered and enabled to actually defend it, both short term and long term, then they're, they're applying their energy to the wrong place.

00:23:44

Sorry, follow up question. In terms of SRE development, how would you, what in your experience, um, you know, taking that, that kind of, uh, responsibility, how does, in your experience, how does that work out from operations?

00:23:57

Um, so, so the question is in terms of taking responsibility for operations from a developer, how does that work in sort of a social aspect?

00:24:04

Yeah, in, in terms of, uh, people buying into the

00:24:07

Idea, the people buying in. Um, my experience is that my developers are all extremely, extremely keen to never have to worry about operations ever again. And I, and in fact, I'll, I'll answer it a different way. Sometimes our developers get too reliant on us, and this is something that we, we sometimes talk about is that we should leave our developers doing a little bit of running it themselves. So we make them partially responsible. We might put them on one on-call rotation a month. Um, we might make them do the releases, um, or half the releases or, and just keep their finger in because sometimes they lose visibility of the fact that there's a user at the end of the journey. And we should all keep our eye on what actually matters, which is the users. And letting out and shielding our developers from that is actually where I see this going wrong.

00:24:54

Thank you.

00:24:56

There's question up there. Yeah.

00:24:57

So at what level of scale, um, is worth the investment of hiring SRE?

00:25:04

What level of scale? Um, now I, I, I have to, I have to respond that my experience is at Google, um, where, where we have, uh, for many years having have an incredible level of scale, but I've seen SRE done successfully, uh, in sort of 40 to 50 person businesses. But it's, um, it's, as you scale down, what you end up doing is applying practices but not staffing. Like you, you would, you would think about applying some of the, the things like, um, e error budgets and alignment of incentives around reliability, but not necessarily staffing up somebody to do that full time. Um, may maybe you need someone to be SRE for, for two months and go onto another project. Um, really, uh, it comes down to where is the value to you as a business? Can you justify it?

00:25:58

Yeah. What would you say if someone asked you what's the difference between an SRE and a DevOps specialist engineer?

00:26:05

What is the difference between an SRE and a DevOps specialist engineer? Um, I,

00:26:13

The, the real answer for me is that SRE has always been a job role. Um, at Google it was what happened when we took a group of software engineers and said, you know, your responsibility isn't is to nothing but the, the production platform, uh, the DevOps specialist engineers, they, they're applying themselves in the same space, but came from a different heritage. Um, we, we had a parallel evolution with s between SRE and DevOps. Um, I mean SRE started in 2002 at Google, but we didn't really talk about it externally. And so there's, um, the cross pollination has only been been recent. So we ended up with essentially different names for the same role and the same goals. So question over here. Um, oh, sorry. Up the back there. First, what would you say is, he's got a microphone,

00:27:03

What would you say is the difference between an SRE and a CRE?

00:27:07

Oh, what's the difference between an SRE and A CRE? So I, I am a customer liability engineer at Google, meaning I'm on a team of SREs called CRE, which causes all sorts of non complimentary issues. Um, so I am genuinely an SRE, but the, my fo the focus of my team is to look outwards at our customers who are using us as a platform to enable them to have success on top of our platform. So much the same way that you would have a team inside your company that runs Kubernetes as an SE team, and then a team above that that runs an application on that Kubernetes cluster. I'm, I'm in an SRE team and I actually interface with other SE teams and our customers. What was a question done here?

00:27:46

Yeah, um, <inaudible>.

00:27:52

So if, if I got it right, then you have SRE teams who run the applications and care about production, but are authorized to make changes to the source code in case of any mm-Hmm. <affirmative> problems or errors. And by the same time, you still have development teams that may develop new features and, and bring them into the, the same source Mm-Hmm. <affirmative> repository. So how do you handle those conflicts between, um, SRA teams changing the code while development teams on the by themselves change the code for the next features?

00:28:27

Um, it's, it's very simple. If we're inside of error budget, then everything is fine and we continue going. And if we run out of error budget, we go back to our development team and say, something's gone wrong. We've gotta work together in order to make this this more reliable. So it's a little bit reactive in that we say the system is reliable enough, so let's remove all of these breaks, remove the friction, increase the velocity, and if if the system's not reliable enough, we're not meeting that criteria, then the SRE team is empowered to do something to address that. And sometimes that's saying to our development partners, Hey, no features for a while. We've gotta work on, um, reliability and efficiency goals. Uh, some, sometimes it's, it's other techniques, but there's gotta be some kind of control structure in place to say, uh, to, to align the incentive around reliability. Um, I, I think that's all we have time for. I'm gonna be in the speaker's lounge after this if you want to have any follow up questions without the audience. Thank you very much.