Las Vegas 2018

Getting Started with Site Reliability Engineering

Jian Ma, an SRE from Google, will talk about SRE key concepts and practices, and provide some insights into how to build an SRE team.


Jian Ma has been working for Google as a Site Reliability Engineer for

more than ten years. He was the first Google Ads SRE and witnessed how SRE concept started and progressed in Google. After that he worked as SRE for Android systems, before moving on to the current position, CRE, or customer reliability engineer. As an CRE, he worked directly with customers on Google Cloud systems. He is also one of the authors of the "The Site Reliability Workbook: Practical Ways to Implement SRE".

JM

Jian Ma

Senior Site Reliability Engineer, Google

Transcript

00:00:05

Good afternoon everyone. Uh, my name is Gian Ma. I am SRE coming from Google This afternoon I'm going to talk about some of the principles and practice used by Google, SRE, uh, and also and hopefully can provide some insight about how to build a SRE team. So with that, let's go ahead. So this is roughly what I'm going to talk about, introduction, service level, objectives, everybody, policy making tomorrow better than today's shared responsibility model. And summary. First introduction, what is S-R-E-S-R-E short for site Reliability engineering. This is a concept, actually a way of optimizing the ops work. Uh, origin originated its Google in 2003. You can find different kind of definition. One way to define it is that it's a framework for operating large scale system reliably. But the second one is we typically paraphrase from Ben Trainer, the VP of engineering inside Google, who many considered to be the fun father of this concept. SR is what happens when you ask a software engineer to design an operations function. So let me emphasize this two key thing here. Software engineer and operations function. So inside Google si own running the system in production, every operational level or put another way we own the production system.

00:01:55

And Google, SIE published two books on this topic. The one on the left is a self reliability engineering book we published in 2017, which was a pretty hit. The one on the right, we just, uh, published this year in July. I think this slide a little bit out of date. We publish in July. This is the companion book of the first one. Provide on top of the concept, also provide the example how to get things done, how to organize. So who am I? My name is Jamiah Ed. I spent 14 years working as Google, SRE. I started in the uh, Google as a Google ads, SRE and continue on to the Android. And then I mean the, in the team called CRE. Actually I am the first Google ads, SRE back many, many years ago. At that time, Google was much smarter than now. So as a benefit of that, actually I witnessed the SI concept, how it started and progressed in Google.

00:03:04

You know, the apps, apps and downs, the correction we made this kind of thing, CRE, the team I'm working on now. That's the reason why I'm here to talk about this is that it's called, it is a short for customer reliability engineering. This is, uh, Google's way of basically trying to push the SA concept into industry to help the Google Cloud customer to operate large scale system in a reliable way. And also I'm author, author of the site reliability workbook, the one with power position this year. So today I wanna talk through, talk about three principles of Google operating our site reliability. Number one, I sorry, need service level objectives with consequence number two s have time to make tomorrow better than today. And number three SR teams have the ability to regulate our workload. What does it mean First, lemme start with a service level objective, which is, I mean it's not a new term for many, but it's this key concept for us.

00:04:19

Lemme describe why it's so important for Google. SRE Service level, objective or short for SRO is a goal to measure how the system behave. On top of that, it's specifically trying to measure the customer experience or short, in plain English. The basic means that if customers are happy, roughly speaking, the SRO goal has been met. And typically you can define it in many different ways. And I can give a quick example here before we discuss on the inside of this thing. For example, the first two are we are talking about availability. The uptime 99.9% amounts or three nine if you do the math, it basically means that in a month you could have 43 minutes of downtime. This is math, quite simple. Or the second one is 200. Okay ratio, you can say four nine in a month, it'll give you basically three and a half minutes of non 200 ratio. The third one is the ancy, 50% percentile under 300. And the fourth one is another example about the of log processing. 99% of the log requests processed within under five minutes. This is typically when you say the pipeline style thing, you have the transaction, you have the back logging, you're processing this kind of thing.

00:05:59

Okay? So we just give the talk about the example. So what's the difference between SRO and S-R-A-S-R-A stands for service level agreement. So here's the difference. SRO typically from our experience is defined as part of the contract between two different companies. There are financial consequences, penalty and other things if they are not met, maybe because that's one of the reason what HAP quite often, often happened is that the customer's experience cannot be sufficient sufficiently expressed by SRA. That's why the S-R-O-K-E-S-R-O concentrate on user experience. What next? Now we have a number, we have a monitoring line, you have that. Then what's a consequence? Because without consequence, you, we just mentioned there is no financial penalty. So what's a consequence? Here is a second is a second concept I want to talk about today. Inside Google, SRE, we use this word error budget policy. So this is what we find out. If you ask anyone, typically ask them how to reliably. Do you want your system to be the answer? Quite often is the more the better. But everybody knows, especially the cloud here knows that 100% reliable is expensive. You have to make a sacrifice on your development velocity and engineering time.

00:07:55

So what do we do? We introduce the budget. Our budget, our budget is basically the gap between the project. Perfect reliability 100%. And what we define early on, 3 9 4 9, the budget is to be spent in plain English. What it means is that this is what we do, okay? We just had an incident, we had 20 minutes of downtime. Let's assume we have a system targeted three nine reliability. We have 20 minutes down time, we can talk without duty feeding that this month we still have 23 minutes in our error budget because we have 43 minutes for the three nine reliability for the month to spend. So what's a policy here? I gave a example on how we define it. Different teams on different servers inside Google define this differently. These are just examples. Give you a taste. So we want go inside, we want to have a visible improvement on reliability. That's a goal. Example here is example that as A SRE we can stand and tell all the counterparties inside the company feature developers, infrastructure and the many others with that no new features launch allowed. Essentially it means that we have the power to say feature freeze.

00:09:39

We also can say that team either is a feature development team or SRA team during this feature release, uh, should feature freeze time period. Your action item only come from the postmodern action items freeze. And also could also say that we want to have a daily meeting with them, with us and discuss what can we make improvement. So lemme summarize the principle number one, we SRE demand and define this SRO to be the SRO with a consequence to be the first thing. It also means that ag organization, any organization, even without hiring a single SR, you can have the same error budget policy. This is just idea. You can implement this today by starting by measure, account and act. Okay, so now let's dive a little bit deeper on why we are insist on this SRC we wanna make tomorrow better than today. SR and SRI and the our budget are only the first step. The next step is staffing. SRSR role. The SI role should have a real responsibility. It's not just a advisory or anything real responsibility. So that's what we found when we build up the new teams SIE team inside Google, we define and refine the service level objective as the number one task.

00:11:24

This person or this several person is at the position to evaluate a sound alarm that the SRO is not met the customer are experiencing pain and we want action to be taken some action. I just give the example phrase this and other things.

00:11:47

So here is a little bit. So here is what we consider joy. I think this might be a little bit, um, surprising for many. Toy is not a, it's an active word in our world. In our world, it covered things like you are going on call, you are doing firefighting, you are doing the incident management, this exciting part of the thing to the not so exciting part of the thing. Capacity planning and uh, for example, you're doing the, as part of the release, you are checking here and there dashboard and looking for the success or failure of the canary before it go through everywhere. All of this inside the Google SRE circle are considered to be toy Toyo and Toyo is negative as such. It's a general practice inside Google. SRE. We want this part of work to be no more than 50% of our time.

00:12:55

This actually among many, many things we want to do and we succeeded to or fail to do. This is the one in the belong to the, we quite successfully done that. We review it. If the team said, the team find out that we have more than 50% of time working on all the toil I just described, it's not good. So what is not toil? What is not toil is project works. What we consider project works. Dig the list quite long. I can me give you some examples. First thing that the question we did a lot is consulting with system architecture design. So basically means that, you know, there's a, a team, a PM and development team want to start a new service. They have design doc, they have review. SRE got quite actively involved even at this stage. We go there, we tell them that from our experience, in order to operate or design a high reliability system, what route is better than what is not. Essentially everybody knows that during design phase, everything the trade off. So we provide the SRE perspective of the trade off. This is one part of that. We also do a authoring and iterating and monitoring. Actually a lot of the coding in this area, my own, uh, my, my my personal engineering efforts are in this part of the area.

00:14:29

Uh, the one of the project I finished last year is that four of SRE wrote a system to processing, uh, time series, normally detection for 1.1 billion time series in real time. So that's the kind of thing we we do. And uh, we also do the automation and uh, automating all of the repetitive works. And the last one is actually also one thing quite important. Writing. What we found out is that writing postmodern listing whole bunch of action item is not difficult, but that's only the first step because this long list of action item could belong to either feature developer team or SI team or quite often no clear responsibility. It just doesn't work. SRE quite typically taking the road, coordinating the implementation of this thing essentially become some kind of pm but with the passion, because we got paged, we got firefighting, we got excited, now we wanna see the thing fixed. This is the full circle. Next page, next slide. Oh, sorry. Okay, yeah, yeah. So lemme summarize what we talk about. The principle number two s have time to make the tomorrow better than today. By make this very clear, we are not there to take the operation load. We are there. Take the pride to make the tomorrow better than today.

00:16:31

So this is the third topic I wanna talk about today. Shared with responsibility model. Here is what we do it. This is one thing I think, uh, so far from what I heard of many other companies doing things is quite unique. Google, Google is a very big company, has a lot of services. If you count the number of the project or the developers working on certain project service, majority of them have no si support. Another way si only support a minority of the services Google provided. This is counting the number of the project and which you way represent how many developers are there. However, of course if you look at another way, the QPS the user, the revenue majority of them are supported by sre. What it means, it means that we do not by default take on a production system default. We do not. They have to work together with our SRE to pass certain bar

00:17:46

In order to get us the support the bars, including of course, obvious one, you have enough user, your service means something, but also it means that your system has to be reliable enough. You follow the SI's practice, you listen to us, you, we all work together, get it reliable. And also almost as important, the management team, the executive team of that service buy this world. Buy is not new. As you heard many uh, teams inside a different company doing this. Exactly buy it in our world, the buy real, if you don't do anything by default, the feature development team, you executive your whole team in charge of a service, we can help, we can do the active design and consulting all of this thing, but we don't take it in order for us to us to take it in the Google's way is that SR is a totally different management architecture. However, our headcount are founded by the service owners, which means that there are real financial decision for the top executives to make in order to get the SI support for the service.

00:19:14

This way they have to think about it and then after that, as we said, we, they have to pass certain bus reliability and other bars before to take it over. But it also means that we can control our workload. If we are overloaded, there's no way for us to writing code to improve, to do the project work to make tomorrow better than today. So this way we can control it, we can reg regulate our workload. There are some examples di different team doing in different way. For example, in this case, we can give the 55% of operation work back to the developer teams. This talking about it's a mature system <inaudible> about in the transition period, this is the mature like ads, like this kind of thing. We gave them a little bit taste on what's the oncall shift raw management and obstacle task. What we find out is that this actually is quite useful to get them understand our principle, our operation model and get them motivated to react.

00:20:26

What we ask rather than, you know, separate out as I used project work, I also just like as, as I described earlier, we are software engineers. Most of us, we, our project has a design dog has everything. It's just real project work. Actually inside Google is basically typical practice is that there is no boundary what r really is. Let's see, you are very passionate. It got Java GA collection tuning, which many consider to be art nowadays. Typically if you inside Google, you're looking for the best Java garbage collection tuning expert. They are SRE or not everyone want to do this kind of thing, but just give you a flavor on what SRE do. There is no boundary. We are real software engineer. We do a whole bunch of things and uh,

00:21:23

We only onboard them if they can be operated safely. And if this la let me explain the last part a little bit. If every problem with the system has to be escalated to its developer, give the pager to developer is instead what it means here is this. Quite often what we find is that during the transition early stage of the service go from the developer into SRE, the service so immature that not only we don't quite understand what's going on, even the developer team, their different part of the sub team, they don't know what the other sub team is doing. So we have to go back to ask whoever wrote this specific feature, what's wrong If we see this thing happen several times, Google SRS general practice is that we'll just send the alert back to whoever is writing it.

00:22:16

We'll say that this is not something the best way to utilize our time, your time. You have to make it more reliable and make it more uniform. We'll help you, but this is your responsibility. Before you do that, I'm sorry, but this is your pager leadership buy-in, as I mentioned earlier. So I the the last part I provide one example of the hosting, how it tie together. When we run out of the our budget, we tell the leadership of the develop feed service team that you have to put your developers on this reliable works system. Reliability works or because everything's the budget, everything's the math. You can loosen the SRO. Let's go from 4.9 to three and a half nine or three nine or four and two and a half nine. Typically in this stage, the the, the service owner team, the the business owner of this, of this service will be quite nervous because this is a real number for them to drop in down, going down. They say where we were through four nine system, now we are three nine. But we made it clear this is the one option and this way we can make them understand that it's much better to consider reliability early on in the whole life cycle of the service rather than you finish design and throw it over. And there's the operation team to do that, which in this case is SRE.

00:23:56

Automation is also, we do a lot, we eliminate toil planning and fix issues automatically. The last item to fix issues automatically. The internal saying is this, if you can write the fix in a playbook, in a process in your documentation, you can make the computer do. It essentially means that auto writing code fix the system automatically even, let me be honest, even in the program circle, programmer circle, from time to time we push back. See it's so complex. If I write a code to do this much easier, I write documentation Next time you read it I'm talking about with each other and not seeing outsiders. But still, because the background of majority of the s Google, SRE are software programmers, software engineers, programmers, people will, will heard. Quite often people just say, okay, let me show you how to do it. Let's end the discussion. Also, lemme summarize, uh, the third principle, SRE teams have the ability to regulate our workload so that we can spend time work on project to make the tomorrow is better. And in order to have time all the time, you have to realize our goal is make customer happy.

00:25:19

How customer how happy are the customer? Is the number, is the math? And the math is calculated by the SRO. So that's a summarize for the third one. And uh, that's summarize for the o I'm talking about for this 20 minutes. Uh, SI need this SRO with consequence ses have time to make tomorrow better than today. And SI teams have the ability to regulate our workload by different ways, including pushback. Thank you.