When Is SRE Right For You? (London 2019)

Stephen Thorne from Google Site Reliability Engineering presents a framework for deciding which of your applications will have the most fertile ground for developing an effective SRE engagement.

breakoutlondon2019

(No slides available)

ST

Stephen Thorne

Staff Site Reliability Engineer, Google

TRANSCRIPT

00:00:02

My name is Steven thorn, and I'm going to be an I'm a site reliability engineer at Google. And I'm going to be talking to you about when his site reliability engineering, right for you.

00:00:14

So I always like to start with an agenda or a brief outline. Um, I'm going to be giving a quick introduction, so he shares some terminology. I'm going to be defining what SRE work is to introduce a frame and then introduce a framework for deciding where and when SRE is best applied in your business. I'm going to give you some examples of, of systems, which, uh, SRE might, may or may not be, be most suitable for you. And also some examples of where you might want to do it anyway, even though they're not the most suitable systems. And then I'm just going to repeat myself in the conclusion. So brief introduction to SRE. I'm not going to go super deep into anything, but I just want to make sure that we're on the same page

00:00:57

Site. Reliability engineering is all three of the following things. It is a set of principles that any team can use to do site reliability engineering. So any team can do site reliability engineering. This was set principles. You can use it to run your systems and production. It is also a way of operating in production, meaning a job function, uh, focused on operating your applications, your software of meeting the needs of your users. You can hire a site reliability engineers, and is also a way of structuring. An organization is a way of structuring an organization which owns, um, success success in terms of delivering an application that you use as a happy to use, you know, they, they find reliable, fast, efficient,

00:01:44

Um,

00:01:46

Google needed to tackle the problem of it costing more and more to operate lodge important applications at scale. And that's where site reliability engineering started. We use it at Google for the biggest, most important pieces of our infrastructure. The value SRE provides us is as we get more and more complex products, that more and more customers use, we have a team dedicated to keep the costs of operating that software under control, just because we get 10 times as many uses this year than last year, doesn't mean we need 10 times as many staff to keep the application running. As we aggressively upon apply toil budgets and driving innovation and production practices, we might even shrink the team of a time or get those teams to take on more applications. So I'm not here to tell you how to do SRE in your organization. What I'm trying to do here is give you an idea of when and when SRE is going to be right for you. Um, if you want to know more about the background and fundamentals of how to do SRE, um, I'd like to refer you to, uh, the books that we wrote, which can be much more detailed that I can manage in the next twenty-five minutes.

00:02:58

Focusing the job role of Esery. What do site reliability engineers actually do? They figured out how to get to the next slide. So site reliability engineers run the application in production,

00:03:22

Running an application means doing its releases, its capacity planning. It's my migrations from one cluster to another, uh, provisioning it, installing it, operating it in order for an SRE team to be effective. They must time limit how much time they spend on what we call toil. The manual pieces. We say that by spending a maximum of 50% of our time doing toil, we have the headroom required to do the project work, to burn that toil back down to do that automation, to do those projects, to do those things, the main things about error, budgets, site, reliability engineering has reliability in the name. This is not about them making things infinitely reliable. It's not what we're about. What I like to say is we're about making things reliable enough. The way we advocate for reliability and an SRE team is to apply an error budget. We say the system isn't meeting expectations, so let's fix it. Or the application is meeting all its reliability goals. Let's focus on improving velocity and decreasing the cost of running it. So if you've got our budget left, you, you, you have a sufficiently reliable system. It's reliable enough. You're actually there to increase development velocity and help everything move smoothly.

00:04:37

When SRE are doing all the operational side of the work or a sufficient amount of the operational side of the work, they know intimately the monitoring and are working on improving the system day to day. They're also the best people to do emergency response. When a system is in trouble, we page the SRS first because they're the people who can mitigate the problems fastest. And the last and most important thing we see as part of SRE work activity is decreasing costs. When toil has been decreased, the reliability addressed, the emergencies handled. We ask, how can we decrease the cost of running this? Both the human costs, how many people on the team are required, but also the resource costs. Perhaps we can run this with fewer VMs, as you can appreciate. Sometimes the best thing for a successful team who has managed to do all of this is just to give them more work, to do. They've been successful, let them be successful with most systems.

00:05:33

So a blitz through some definitions. And now I want to introduce a framework to you. This framework is sort of designed to allow you to make a judgment about where it's best to start doing SRE in your company in context. And I don't want to sort of give you a checklist and you, if you end right through this checklist, you find the correct systems of work with, um, I'm, I'm giving you ideas along these dimensions site engineering is pretty special. You've got a team of generalist operators who can program automate, do emergency response cost, optimize training them on applications is going to be time-consuming. And as a result, you want to make sure that they're doing worthwhile engineering that provides value to your business. And before presenting this framework, I just want to give a quick validation to anybody present, who actually currently practices SRE you are special and wonderful.

00:06:23

The goal here is to help decide where the best places to start doing a surreal focus on where to go next. I don't know your organization, your context, none of these decisions happened to the vacuum just because you don't hit any of, all of these points doesn't mean not doing SRE or that you can't be successful. So when is SRE right for you when it's site reliability engineering, right for you, I think you need to hit all three of these points to be maximally successful. The closer you get here, the better mission critical is it worth it to the business to invest in site, reliability, engineering, operable, can the SRS do anything on the system to maintain it? What, what about when things go wrong? Immutable can SRE make it better over time. Mission critical operable mutable. These are the three dimensions they're not binary. Yes, no, but they're, they're relative. You can say this system is more operable than that system.

00:07:21

The first and most important attribute in the application when, uh, when considering applying Esri principles is how important that application is to the business. The amount of funding and effort application deserves to maintain its reliability should be proportionate to the value of that application to your business. The more mission critical the application, the more investment in its reliability is engine. So what makes something mission critical? Um, you know, customers will notice it impacts your revenue. Um, all other work stops. You can even call your CICT, uh, pipeline, uh, mission critical. If you decide that there's enough business value there and keeping that up and running and not working. So what questions can you ask to assess the criticality? If you had to choose one to say first, which one would it be? Um, perhaps it's entirely an economic decision system acts independent off and makes more money than system Y so do that first to judge how important system is, think about questions. Like, should I wake up at 3:00 AM to fix the system? What about 4:00 AM? Well, that's 6:00 AM. Like, where is your value judgment here? How important is this system to you? Your customers, your business?

00:08:35

I think mission critical systems are the best place to start your SRE journey. Next one, operability the operability of a platform of an application refers to how a team might interact with the system, both in the normal day-to-day maintenance and how they might be able to perform fixes when things go wrong.

00:08:57

So what do we mean by operability? And why is it important if your Tosca team with running an application in production and make them responsible for it that they need to have confidence, they can do their work. If day-to-day, you have the ability to, uh, shepherd your releases, scale your resources. When you have a success disaster, and you have to scale up rapidly or a bad release, and you have to roll back, you know exactly how to get that done. Operability means the actions can be taken by someone who's not the person who wrote it in the first place. You can run it, maintain it and debug it. There are things you can see in an operate operable system. You see monitoring that you actually like, you have confidence that you can fix outages, a high level of confidence that your procedures work and a team member can execute them without help Things that make operability Hata could be no visibility in what goes wrong. Fixing errors requires an expert developer to write a fix and release it in a roll forward, a rollback. So typically balked by data migrations and things like this. Um, you might have a problem with, uh, vertically scaling your compute, which means that overload is much harder to, to mitigate because you can't get a bigger machine than that. Um, single points of failure exists, and the only recovery is to restart, rebuild or redeploy that single point of failure

00:10:18

Operable systems mean your SRS can actually do something. When you have an operable system, you Esri's will be able to provide value day to day and actually do that toil and fix those outages and do something with that system.

00:10:34

And I think the last thing you need in order to, for your site, reliability engineering team, to be able to provide value. Once they're working on a system is a way to make that system better over time. And you need to enable that the immutability of an application refers to how possible it is to change the application in order to make it more reliable or decrease the cost of operation. If a site reliability engineering is spending a maximum of 50% of that time, doing those operational aspects, that toil, then the remainder of that time should be spent making things better. And that's why we need mutability for something to be a system which where you would apply site reliability, engineers to Mutability is a dimension that you can actually achieve in many ways. Can you redesign it? Can you change it? Do you have plans to improve it? Are you going to task your SRS to do those improvements? You've got measurements about your system. What are you going to do about those measurements

00:11:31

To assess how mutable a system is a system is, um, you need to think, what am I going to do to task these engineers site, reliability, engineers, need projects and goals to accomplish beyond just keep it running. If they decide that a rearchitecture is the way to meet our reliability or efficiency goals, is that ever actually going to work in that organization? Are they going to be respected? Do they have a seat at the table? Are you going to enable that? Um, will they be able to fund their work if they, if they need to spend more? Um, and so good question. What happens when you get a hundred times more users, will this team be able to handle it or will they be powerless? And they just have to throw up their arms and ask for help. Mutable systems mean your SRS, working on those systems can provide long-term value. If they're not providing longterm value, then they're hamstrung and that's our liability engineering. They can't provide value. Long-term they're just operating the system.

00:12:32

So I'm going to run through some example cases. Now, please bear in mind that these systems here are entirely are examples and the value judgements on my own and your systems might be treated entirely differently, architected entirely differently, but have the same names. So first is something that I hope is a good example, and serving as a sort of thing that either works and that's displayed or it doesn't, and then users get slow loading pages. So it's pretty mission critical, probably easy to justify this in your business. It's operable, there's lots of servers. Cause you've got to serve lots of content to a lot of users. You have to provision them, scale them up, chewing them, monitor them. There's all of this work to do in order to keep this fleet running. And it's mutable. The team running production can think of ways to run more efficiently, to do safer releases, decrease batch size scale, faster and more efficiently and reduce that toil and operating system.

00:13:21

So I think ad serving is sort of a good first example, but then my second example is something that's much harder to SRE. Now, bear in mind, everything here is relative. You have to read anything, but the question is, is it worth it to your business networking devices? And I mean, specifically vendor networking equipment, it's probably entirely mission critical. Um, if it's down, nothing else works, but you don't really have much you can do about it. Especially during a of stay. It's probably just going to be safer to wait for quiet period and do safe scheduled maintenance. And the weekends, you know, this is a model that we're very familiar with. That's immutable. You can kind of think of project work to do, but it's largely going to be, how do we add redundancy or test out config set up. It's not going to be the sort of thing that the business will, will find a lot about Palio. And so if you ever SRE your networking hardware at all, it's probably going to be something you do further down the road, once you, once you've done everything else.

00:14:13

So how's hosted software. This is software you, you run, but you didn't write yourself. Your business might be able to handle your CRM CRM being down for a few hours, or it might be a critical dependency on your business. So the mission criticality depends very much on how you use it, what it's for operability part for making sure it doesn't run out of Ram. Um, there's really not much to do you just keep it running. That's immutability. Isn't the sort of thing you need to be sought guardian to be tossing site reliability engineers, to do much about because it's somebody else's soft software that you're just hosting. So perhaps is something that's better. So I, through the, in this example, um, think is to HR, your website, isn't going to be mission critical enough to justify an SRE tape team taking care of it. It might even be operable immutable, but is it going to justify the staffing? Is that something better to consider a webshop?

00:15:05

This is probably just as critical as the network hardware example, your customers are going to go elsewhere if they can't access your website, your web shop. So operability, there's all the routine maintenance tasks. Um, plus you have to think about capacity planning for big events, uh, black Friday or boxing day sales can be a big deal in the, uh, in the web shop space and mutability will likely to be a large number of projects you could do based on what can go wrong with a web shop. You might need to work on making it more reliable, protect against internet threats, or if things are going pretty good partner with the development team to increase that velocity Data processing pipeline. So it might be possible for it to be down for, for, for a while before it has business impact. You know, it's a data processing pipeline. Data goes in one end, six hours, 12 hours later, it comes out the ELA, but it doesn't actually affect anything for a couple of days, but it's also going to be pretty operable scaling it, maintaining it, debugging bad data and troubleshooting. There's lots to do. And typical typically SRE teams, I've known that I've engaged on data pipelines, end up forming an extremely strong and productive relationship with that dev team. So this is actually somewhere where I think SRE can provide a lot of value to your business, but perhaps there's something more critical that would provide more immediate value to your business that you should apply to SRE to first

00:16:31

My last example here, I think it's a Kubernetes cluster. So this is the cluster or clusters that, that hosts are mission critical workloads. And it's going to be, again, pretty much up there on the mission criticality aspect. It's going to take maintenance to keep it going, and that would maybe need to be given to experts to get it done. So this could be our SRE team mutable. That's probably project work. You can do to help this go over time, or you might just let it take a long. And the challenge here is if you have a Kubernetes cluster that all your applications run a on the temptation is you might get SRA to take on the Kubernetes cluster, but really actually you care about the business outcome of the applications running on top. So you might actually task your SRA team with running the most critical application and the platform will get two teams, one that depends on the other, and that's a sort of way that you can pose the situation so that you, you end up providing value all the way through. So I've given you a couple of examples and a, and a sort of a relative feel for like, this is more critical than that. This is more operable than that, but what if you want to do it anyway? Um, and on a system that isn't so suitable for SRE, well, there's ways you can do that. The first way is you can always lower your expectations. If a system is not super, you can set lower targets,

00:17:52

Um, in site reliability engineering. The first thing I always think about the service level objectives, um, the service level objectives of my system, I'm measuring its reliability. Is that reliability good enough is my SLO a good target? I know that it can take a lot of work to keep a system running at four nines of availability or in other times, uh, five minutes, a month of downtime. But with the same amount of engineering time, I can probably support a dozen systems at two nines of availability, which gives me a budget of eight hours of hot downtime a month. So that means that I can take the same amount of staffing and deal with many, many more systems. And it allows us to run, uh, more efficiently across a lot larger fleet.

00:18:37

Yeah.

00:18:38

So the process of vetting the application before SRE takes responsibility of it is important. This is a process that we go through at Google. We don't take for granted that an application that any application running in production is always run by SRE. Typically, actually we have out developers build it, build it and run it. And then eventually they decide which applications they going to fund SRE teams for. So when they want to stop SRE on that system, the SRE team ramps up on that system, using what we call a production readiness review. It's this review that often significantly increases the operability of a system. It's where we document all the knobs and buttons. We can twiddle run through the checklist to make sure all the via standard operational actions can be done and things have conformity and uniformity and completing the production readiness review or PRR will confirm the operability of the system. European IRR is a way of making sure that yes, sure. That before the SRE team agrees to shoulder the ultimate responsibility for the success of the application, that you can make sure they're going to actually be able to provide a benefit once they do,

00:19:51

You can always add more failure domains. It might be expensive to duplicate critical infrastructure, but if the business sees value in it, it might be worth doing and worth doing well. I often get asked, how can I run a reliable system when all my critical dependencies are unreliable, the answer is very context dependent, but it's possible by adding failure domains, duplicating entire systems or using techniques like availability, caches that you can make it so that even though the backend goes down, you've cashed the data you need, or there's a second independent system that you can query allowing you to keep your production system working. Your systems might all feel immutable from the operations side today, but it doesn't have to be that way you can enable your engineers. Enablement might come in various forms. It could be allowing them access and training to be able to do development alongside your software development teams. It might be bringing SRE into the design process, introduce your S3 to product management and talk about how best your customers can be served by talking all the way back at the product management level. This enablement is what's going to an ultimately results in your SRE group, really caring about the customer experience and tooling them up to defend that customer experience in the longterm. And that's what our story's all about. It's about aligning the operational aspects of running a system with what your customers actually want, which is a system that actually works and delivers value to them. It results what your business wants because your business wants to provide value,

00:21:27

Sorry, and conclusion Assess your applications, SRE the one that matters most And is operable And mutable And iterate. Thank you.

00:21:53

No, I believe I have about five minutes for questions. If there are questions from the audience, Feel happy to repeat. You do have a question.

00:22:03

Yeah. Um, so if we would like to start or convince or ride them, try to convince our superiors, for example, that this is the way to go. What should they start?

00:22:17

So if the question is, if I want to start SRE and I want to convince my company leadership, that this is the way to go,

00:22:26

What should I start with open up?

00:22:29

I think you should try and figure out where it will provide value in your business, because you should be able to look at, look at some applications, some system and say, either we can make this more liable or decrease costs of running. And over time, this is literally how, how much better it will be and try and figure out how you can put it in business terms, because if you can't figure out how it's going to provide value, Then why do you need to do it? Yes, exactly.

00:23:02

Um, would you say, um, start with one system or just thinking about scale, take one. POC.

00:23:09

Th so the question is, would I start with just one system? Um, I would start with Potentially one system or one stack, like you might say, I will take responsibility for our most critical application, plus the Kubernetes cluster plus whatever dependency and then subdivide. And I would iterate, um, you, you want to have people in place that can actually defend the customer experience and if they, if they're not empowered and enabled to actually defend it both short term and long-term, then they're, they're applying the energy to the wrong place

00:23:44

Up question in terms of SRE development, how would you, what in your experience, um, you know, taking that kind of, uh, how was you experience, how that works from operations?

00:23:57

Um, so, so the question is in terms of taking responsibility for operations from a developer, how does that work in sort of a social aspect

00:24:05

In terms of

00:24:07

The people buying in? Um, my experience is that my developers are all extremely, extremely keen to never have to worry about operations ever again. And I, and in fact, I'll, I'll answer it a different way. Sometimes that developers get to reliance on us. And this is something that we, we sometimes talk about is that we should leave out developers doing a little bit of running it themselves. So we make them partially responsible. We might put them on one on-call rotation a month. Um, we might make them do the releases, um, or half the releases or, and just keep their finger in because sometimes they lose visibility of the fact that there's a user at the end of the journey. And we should all keep our eye on what actually matters, which is the users and letting out and shielding out developers from that is actually where I see this going wrong. It was crushing up there.

00:24:57

Yeah. So that's what level of skill, um, it's worth the investment of hiring SRE?

00:25:04

What level of scale? Um, now I, I have to, I have to respond to that, my experiences at Google, um, where, where we have, uh, sin for many years, having had an incredible level of scale, but I've seen that surgery done successfully, uh, in sort of 40 to 50% businesses, but it's, um, it's it, as you scale down, what you end up doing is applying practices, but not stopping like you, you would, you would think about applying some of the things like, um, error, budgets, and alignment of incentives around reliability, but not necessarily stopping up somebody to do that. Full-time, um, maybe you need someone to be SRE foot for two months, then go on to another project. Um, re really, uh, it comes down to where is the value? Do you use a business? Can you justify it?

00:25:58

Yeah. What would you say if someone asks you what's the difference between an SRE and a dev ops specialist engineer?

00:26:05

And what is the difference between an SRE and a dev ops specialist engineer? Um, I,

00:26:13

The, the real answer for me is the SRE has always been a job role. Um, at Google. It was what happened when we took a group of software engineers and said, you know, your responsibility isn't is to nothing but the, the production platform, uh, the dev ops specialists, engineers, they they're applying themselves in the same space, but came from a different heritage. Um, we, we had a parallel evolution with S between SRE and DevOps. Um, I mean started in 2002 at Google, but we didn't really talk about it externally. And so there's, um, the cross-pollination has only been, been recent. So we ended up with essentially different names. So the same role and the same goals. There's a question over here. Oh, sorry. Off the back of the first,

00:27:03

What would you say is the difference between an SRE and a CRA?

00:27:07

Oh, what's the difference between an SRE and a CRA? So I, I am a customer liability engineer at Google, meaning I'm on a team of SRS called CRE, which causes all sorts of non Clemencia issues. Um, so I am genuinely an SRE, but the, my, the focus of my team is to look outwards at our customers who are using us as a platform to enable them to have success on top of our platform. So much the same way that you would have a team inside your company that runs Kubernetes as an SRE team. And then the team above that, that runs an application on that humanities cluster I'm I'm in an SRE team. And I actually interface with other SRE teams at our customers. It was a question down here.

00:27:47

Um,

00:27:52

So if I got it right, then you have SRE teams who run the applications and care about production, but are authorized to make changes to the source code in case of any problems or errors. And, but at the same time, you still have development teams that may develop new features and bring them into the same source repository. So how do you handle those conflicts between, um, SRA teams changing the code while development teams on there by themselves change the code for the next features?

00:28:27

Um, it's, it's very simple. If we're inside of error budget, then everything is fine and we continue going. And if we run out of their budget, we go back to our development team and say, something's gone wrong. We've got to work together in order to make this, this more reliable. So it's a little bit reactive in that we say the system is reliable enough. So let's remove all of these breaks, remove the friction, increase the velocity. And if, if the system is not reliable enough, we're not meeting that criteria. Then the SRE team is empowered to do something to address that. And sometimes that's saying to our development partners, Hey, no features for awhile. We've got to work on, um, reliability and efficiency goals. Uh, some, sometimes it's, it's all the techniques, but there's gotta to be some kind of control structure in place to say, uh, w uh, to, to align the incentive around reliability. Um, I think that's all we have time for. I'm going to be in the speaker's lounge off to this. If you want to have any follow-up questions without the audience. Thank you very much.