Las Vegas 2023

Ten Things We've Learned From Running Production at Google

10 key lessons that Google has learned from 20 years of SRE: How to establish a reliability culture, fight toil, and manage change.


Christof Leng

SRE Engagements Engineering Lead, Google





So over the years here at the DevOps Enterprise Summit, we've had so many talks on site reliability engineering. Uh, there are so many ideas that come from SRE principles and practices, uh, which Google pioneered in 2003. Uh, I think it's one of the most incredible examples of how one can actually create a self-balancing system, helping product teams get features to market quickly, but in a way that doesn't jeopardize the real reliability and the correctness of the services they create. So, uh, I'm so grateful for the next speaker because for over a decade I wanted to understand why Google chose a functional orientation, uh, for the site reliability engineers. To this day, thousands of Google SREs are still in one organization reporting to Ben Trainer SLOs, VP of 24 7 engineering, which includes SRE very purposefully outside of the product organizations. So the next speaker is Dr.


Christophe Lang, SRE engagements engineering lead. So over the years, he managed and worked on various parts of Google services, including cloud ads and internal developer tooling. So I've learned so much from him and, uh, one of his colleagues, Dr. Jennifer Poff, about how Google, SRE leadership interacts with the dev leadership in this sort of functional orientation. He's spoken at this conference three times and this year he will share some of the key lessons that Google has learned from 20 years of SRE, how to establish a culture, uh, that enables reliability, fights, toil, and manage change. Here's Christophe.


Thank you.


Thank you Jean. Um, thanks for having me back. Um, hello everyone. Um, today I wanna talk to you about 10 things that we've learned over the years from running production infrastructure at Google. So, as Gene already said, SRE is uh, something that was started by Ben Sloss at Google pretty much exactly 20 years ago. We are celebrating the 20th birthday right this month, and the idea behind it is taking a software engineering mindset and methodology to design and run operations. And it started at Google, but it's an industry-wide practice now. And, uh, uh, over the many years we collected Proverbs that describe best practices and common pitfalls that we have encountered. Some of them originate from Google, others we have adopted, um, only that we call them prod verbs. Now Google at SE are thousands of engineers in one organization that work on basically every major Google product. And so we have many, many years of experience in very different areas. Um, Google products can be very large, can be relatively small, can be very fast moving or very stable. I think these topics apply widely, but um, you shouldn't just copy paste them to whatever you are doing because every environment, every organization is unique and, uh, that won't get you far, very far. You should use them as food for thought. I hope they are useful.


Now, the first of the three sections is about culture. Say famously say culture, it's strategy for breakfast. And I believe much like the DevOps movement culture has been very essential to sri's success. So let me start with the first principle. Reliability can't be taken for granted. It's a little bit like these basic things in day-to-day life, air, food and so on. It's easy to forget that they even exist, that you need them while you have plenty. But when you run out of them, it can be very existential and it can be very hard to get back to normality because when you run out of reliability, not one thing has gone wrong. Many things have gone wrong and you need to fix a lot of them to get your systems on products reliable again. So that's why there always needs to be a voice for reliability at the table. That is the role that SE strives to fulfill. And it's why our official motto is hope is not a strategy, it needs something better than that. Number two is the metaphor. Cattle and pets, uh, you should not have pets. Your system should not be pets, your system should be cattle because pets are unique. They have names, they have personalities. You invest a lot of time and energy and money into maintaining them because you care deeply about each individual, one of them. And each individual, one of them is different. Whereas for cattle, you care about the herd, not the individual. The uniform, they typically have numbers instead of names, and individually they're cheaper.


So this is not about actual animals, this is about machines. Uh, I don't endorse industrial farming, but luckily machines are not living beings. So, um, we should standardize them, we should scale them. Um, that makes it a lot easier for us to run large systems, run them efficiently at low cost, and also allows us to change relatively fast and easily. And don't forget, cognitive load is an important bottleneck.


Number three, you probably have all heard about blameless postmortems. Let me talk a little bit about the why. Why do we care about blamelessness? A lot of people think it's about just being nice to each other. And it's true. Yes it is, but it's not all of it. There's more to it. So, um, when you create an uh, environment where people are afraid to speak up, to admit mistakes that they have made to, um, flag when there is a problem, you will not get the full information about the weaknesses in your infrastructure, in your systems. And then you can't fix those. You can't do good risk management. So you need people to be able to speak up, to be willing to speak up without a fear of consequences. And, um, that, uh, is also, also that it doesn't. Finger pointing doesn't help with anything because you won't be able to fix people. You need to fix the systems and the processes. So if there's this big red button that Christophe pushed, that took down the whole system, the question is not why is Christophe so stupid? Believe me, <inaudible>, I like big red buttons. <laugh>, the question is why do we even have that button in the first place and why is it so easy for Christophe to press it without thinking too much?


Number four, measuring. I don't have to tell this community about the importance of metrics. We've talked about it at length over the last few days, but metrics also have risks. So when you put out a metric and say, this is important to the business, this is important to each individual's careers, they will start optimizing towards that metric. So that metric better actually aligns with business outcomes. Often we use proxy metrics and it can be very misleading if people try to optimize for them instead of the actual business outcomes. And also when you don't measure something, it typically gets worse because people are not paying as much attention. They're very focused on improving that other metric that hopefully will get them on promotion. So you should be very careful of what you select on what you measure and what you don't measure, because you can't measure everything. If everything is important, nothing is. And um, then always iterate on your metrics. There's not just this goal set of metrics that will always be true for everybody everywhere. It really needs to align with your business and then apply context and not just follow the numbers. So the second section is about operations.


Operations is not as a, is not an operations role, but operations plays an important role in our role. So let me explain. First of all, uh, I like to say that the only way to really understand the limitations of a system is to see it go up in flames with live user traffic. And you have to be there. Don't read the postmortem, be part of that experience. Like in the situation, you will have so many more opportunities to dig a little bit deeper to understand what has happened. And only with that information you can really improve the system. But that is the goal, not the on call. The oncall is a means to an end. And there's another thing to it. And that is when you are on call, when you are working together with the developers on these systems, you have skin in the game, they trust you because you are in the same boat, and then they're more likely to listen to your advice than if you come on, if you stand on the sidelines and give, uh, smart comments. But when you're on call, don't be a hero. We don't need heroes. They're actually very bad. Heroism is not only bad for the hero themselves. If you look at literature, heroes tend to have a lot short life expectancy.


It's not good for your health, mental and physical. It's also not good for the team because it creates a certain culture, um, that is not very sustainable. And everybody thinks if, if you say like, look, Christopher was so great, he stayed all weekend to fix the system. And, and, and the customers are so happy, like everybody else in the team thinks, oh, I should also do that. And, uh, don't, don't applaud people for that. I mean it's, it's sometimes we need to go the extra mile. But you should be very careful to not create a culture where this is expected




It's also bad for your products and your systems if you want to. Like, if the best thing is to like extinguish fires, uh, it's boring if things never catch fire, right? Um, people won't invest so much time into improving the systems early on 'cause they're very busy with finding fires. And also the on-calls should never be alone. It's like the worst thing that can happen to you is getting paged at 3:00 AM in the night. Nobody's around. You have no idea what this alert means. You have no idea what to do, you're stuck and you don't know how to escalate, who to escalate. There's no way forward. I never want to be in that situation and I don't want any of the engineers to be in that situation either. So there should always be people around in one way or another. There should be clear escalation paths on call should never be alone. It leads to faster mitigation. It leads to better insights on like how to improve the system. And it makes on-call a lot less scary. We say you should automate yourself out of your current set of tasks about every 18 months. And the reason for that is if you keep doing the same things that you have been doing,


You always will create, get extra work, new systems, new things. Um, increasing toil from increasing user traffic. You will drown in toil you will only do this repetitive work and you won't have any time for engineering and for improving the systems. So simply for staying at the same situation where you can actually have some time for engineering, you need to aggressively automate. But most people think that automation is first and foremost about efficiency. I would argue it's more about consistency. 'cause the automated script was always the same things. If you ask me to turn up five clusters lu, one of them will look exactly like the other. I will be very, very careful, but I might like overlook something here. Forget something there. Like that command couldn't run there so I tweaked it a little bit. So every one of them will be slightly different and they will be become pets and you don't want pets because these things will blow up later because nobody expected that this wound cluster had this flex set differently. The third section is about change. Change is super important. Change is happening all of the time, um, in our organizations and the world around us and in our software systems. Software is fast moving.


So how does change impact how we run production systems now? First of all, it breaks them. Change is the number one reason fors. So should we stop changing software? Quick show of hands,


A few.


So probably not. It would make life easier, right? But um, also not very enjoyable. We love changing things and it makes our products better and makes our users happier. There is some inherent risk to change. It's a risk that we're willing to take, but there's also accidental risk from poor change management and that is something that we can minimize. So first of all, you shouldn't just like roll out a feature globally and see what happens. I can tell you what will happen. Um, use incremental rollouts and test at every single step. Test a non-pro test with a small percentage of users, test with a small percentage of the regions and so on. Do not deploy a config that hasn't been submitted to a code repository and being code reviewed because you, if you don't do that, if you like just run the conflict from your command line, you don't actually know what's being deployed in production. And I've been debugging outages for way too long only to find out that somebody else on the team had pushed this one flag, uh, from, from their workstation. And, uh, don't deploy on Fridays for whatever Friday is in your part of the world. Um, not on weekends, not on holidays when nobody is around because as long as your rollouts are not great and nobody's rollouts are really great, um,


There won't be anyone around to fix that. Do that during business hours. It's much more convenient, believe me. Uh, you'll have coffee and everything, but if you ever get to a state where actually rollouts just never break, roll out all the time obviously. And tell me how you did it. Number nine, outages will happen. There's an inherent risk to change and it will cause autos and other things cause autos too. Like I know a hurricane, an earthquake, I dunno. Um, that's okay. The idea is not to prevents it is to minimize their impact. Minimize their pla placed radios on like how many users are impacted, how many regions, how many of your products, and to reduce the time to mitigation that the system is up and running again. So root causing is super important, but it can wait. Collect all of the data during the outage and then analyze later. First the system needs to be up and running again. Be able to roll back things quickly and then analyze why the the release didn't work. And to be able to support your postmortems and your root causing. Use written communication during the incident as much as possible


Because then you will have a paper trail of what actually happened and it will be easier for others to join you and help you and read up on what had already been tried and give everybody access to the source code so they can actually figure out what happened in the code that caused this outage.


Last but not least, no haunted graveyards. Um, if you don't manage your technical debt actively, it will get worse and you might reach a point of no return and then nobody wants to go near that code anymore. Have you ever seen something like, don't touch this coat. Very important <laugh>. Yeah, I love these this's. The first thing I clean up in the coat base, like what happens if I remove that? Okay, interesting. Hmm. That's a western work here. Because these things are booby traps for change. Like if you have many of these things in your coat base, they're just unmaintained and nobody wants to go near them anymore. They will trigger when you change something else and they will blow up in your face in the worst possible moment. So clean them up early on.


So let me summarize. Did we learn? First of all, running production systems is a team sport across silos, build relationships, work together, or you will fail and have a horrible time. Second, there was change. Your systems need to change, your products need to change, your organization need to change. You need to keep changing. That is good, that is healthy. But don't try it by working hard, by do it by working smart. Do it through engineering. We are engineers and try to keep things simple. Everybody can build a complex system. That's not hard. That's not something to applaud. Try build boring systems. Thank you so much. My collection of topics is definitely not complete. So the help that I'm looking for is like what are typical principles and proverbs that you know that you have learned and that you might want to share with me. Um, I will be at the Google booth, uh, uh, downstairs, um, upstairs, somewhere later today. Thank you so much.