Las Vegas 2022

ProdEx - Google's Production Excellence Program

ProdEx is Google SRE's flagship program for production health and operational risk. It has been running for 7+ years across the SRE organization.SRE directors assess the health of SRE teams and provide coaching in an interactive review setting based on production metrics and business context.

CL

Christof Leng

SRE Engineering Lead, Google

Transcript

00:00:17

So this week you'll see two prominent themes in the talks. The first is definitely site reliability engineering. The second is about how we organize the best, achieve the outcomes that we want. So, uh, in many ways the next talk is about both. There are so many things interesting about site reliability, engineering principles and practices, uh, which Google pioneered all the way back in 2003. Uh, I think it's one of the most incredible examples of how one creates a self-balancing system, uh, that helps get teams go to market quickly, uh, without jeopardizing reliability and correctness of the services they create. So, over a, for over a decade, I've wanted to better understand why Google chose a functional orientation for their SREs. To this day, thousands of SREs at Google are in one organization reporting to Ben Trainer slots, VP of 24 by seven engineering, which includes SRE very purposely outside of the product organizations.

00:01:08

So the next speaker is Dr. Christophe Lang, SRE engagements engineering lead. Over the years he's managed and worked on various parts of Google services, including cloud ads and internal developer tooling. And over the pandemic. He spoke with Dr. Jennifer Poff, global SRE Director of Education. And I learned from him personally so much about how Google, SRE leaderships interacts with dev leadership in this sort of functional organization. And so he's presenting today on how he's helping his organization ensure that they all have excellent SRE who can help their customer succeed. And how SRE fosters Production health across all Google fleets, Google's fleet of services. Here's Christophe.

00:01:52

Thanks for having me. Hello, everyone. Thank you. Um, I'm going to talk about products, that's Google's production excellence review program, um, which helps us to manage operational risk and, uh, promote best practices across Google, SRE. But why are we doing it? To explain it, I have to talk a little bit about organizational structure first.

00:02:32

So, SRE at Google is a central specialist organization with its own reporting hierarchy and organizational structure, but it is matrixed into the individual product areas, the business units, to support them to align with them, much like a lot of other specialist roles at Google as well. Now, this is the classic matrix, uh, model, and it comes with challenges. You need to align on the vertical, on the business alignment. Um, the SRE in the individual. Um, teams need to understand the systems and, and the business needs in that area. But also SRE is a community and community that learns from each other, that builds platforms together, that's established standards, um, and and promotes best practices. So we learn from each other and to be able to have our full potential, um, we need to encourage that exchange across SRE, which is a very large organization by itself.

00:03:50

And there is a lot to say about the business alignment, about the system reviews, and I hope I can talk about this another time. But today I'm going to talk about, um, this other dimension, the horizontal alignment, the promotion of best practices, standards, and understanding mutual, mutual understanding. Now, the goals of the products program is to drive these operational best practices that we have that are constantly evolving and, and production health across all of SRE and by extension across all of Google's products. And the idea here is to assess the main risk areas for the SE owned production services to see like what can we do about them, how can we manage them, how can we mitigate them? And especially to identify individual SRE teams that may need more help. There are always some hotspots, there's challenges and it's a very dynamic situation.

00:04:57

Um, but it is not an audit. It's not a compliance exercise. It's a coaching opportunity for these S3 teams to learn from more senior leaders and, um, to understand better, uh, how they're doing and what they're doing. But it's also an opportunity for these reviewers to get cross SE visibility and awareness and bubble that up to SRE leadership to inform the overall S3 strategy. What it is not is the business alignment, the vertical thing, um, nor the compliance saying, here are the policies that you need to follow. It is not there to criticize it is to help and to add perspective.

00:05:51

So how do we do that? So first of all, a lot of these review programs that probably all of you do in one form or no another are kind of unstructured. Everybody has their own slide template and, um, these evolve and every team does a little bit different. So these things are not repeatable. Everybody's using a slightly different set of, of signals, and that really prevents us from using the, the reviews as a data source to bubble information up. And it also is a lot of overhead and preparation for the individual teams. Not uncommon that a team spells multiple days or even more than a week to prepare such an unstructured review. And we don't want that. We want shared metrics and we build dedicated tooling that actually automates the data collection. You can still tweak it if our, uh, automation may have missed some of the systems that you're responsible for.

00:06:58

And then we also want to apply context from the team. Um, we are a data driven company. Data is extremely important, objectively measurable data especially, but the context, the perspective of the team and the business context also matters. So there's plenty of room for them to provide speaker notes to annotate, um, to explain why the data looks the way it is. And then there are two senior reviewers per review session. Um, they're typically directors or principal engineers and they review the findings together with the team. And important here is they do not typically come from the same area. They're not from the reporting chain, they're not the bosses. They are senior leaders from elsewhere in the organization. So they can provide an outside perspective and it's less intimidating. And generally we aim for every team to get reviewed at least once per year. Um, but if we see a low score, if we see a lot of risks in a team, the teams get automatically scheduled more often.

00:08:18

So we keep tabs on those and try to help them and make sure that, um, they dig themselves out of that hole. And talking about all of these things are nice, but the review itself is only the starting point. Um, the real goal is actually to identify actions and to track these actions and make sure that it's actually things change, things improve. So that's an inherent part of the program to generate action items in areas where we see the need for improvement. And then in the next review session to review these together with the team and review the progress on those.

00:09:04

So all of that might not, may sound nice as an idea, but can we actually show that it works? And to give you a little bit of context, the program started seven years ago. There are over a hundred S3 teams signed up now. Um, we do not centrally force teams into the process. The individual product area, S3 leads, the directors actually sign their teams up because they see value in that, because they see value for their teams. And over these years, um, there are over a thousand reviews that were conducted and over 40 different reviewers have participated. But that's output, that's not outcomes. I will talk about the outcomes later, but to better understand them, let me dive into more detail on how the program actually works. What is being discussed in review?

00:10:09

Now, there are six areas. The first one is team information. That's typically a quick one. It's just to make sure that the team actually has a purpose and it understands its purpose. It has a scope clearly defined, and a plan to work to its mission. Every SAE team is expected to have a charter and charter being signed off and up to date and a roadmap on how to work towards that charter. If you don't have that, if you do not clearly cannot clearly articulate what your purpose is, then a lot of other problems will ensue.

00:10:48

Second oncall health products is about a lot about operational risk and pager fatigue is real. It's a huge risk for an S3 team because operations is a means to an end to an S3 team. The actual work is engineering projects. But if you spend too much time on operations responding to incidents, you will not have time and energy to do that. So we look at how many incidents do you have? How many of these incidents actually force alerts or something that you cannot do anything about them because I don't know, the network is down, you cannot reach the database. So well, um, still you might get paged and distracted from your engineering work also, like how noisy is your alerting? Every time a single thing goes wrong, a hundred alerts fire, um, that is a problem. What is the staffing? Do you have enough engineers that you are not on call every other week?

00:11:59

It's important to make time for actual project work, but we also look at directly at the project work and we look at that through the OKR completion rate. How many OKRs does the team have? Does it actually do proper, um, goal planning? And, um, how many of these have a very low score, say below 0.5? Um, if you like, start a lot of things, but don't finish them. It's not helpful. You need to focus better. Also, like how does this compete to other operational toil work that might not be an actual outage, not an incident, but like a ticket queue where you have to work through and what are with tickets that turn out to be actually projects in disguise? Do we have a policy on how to put them into your backlog and, and not disrupt your actual project planning? If all of these things are working, then we are likely to have an SOE team that can actually deliver valuable engineering work.

00:13:17

Another area that we look at is are your service level objectives well managed? And it's not only about everything being green, but it's actually, do you have SLOs defined for all of your critical aspects? And are they signed off by your stakeholders? Do you have rationales for them? And a rationale might just be, well, that is the historical performance of the systems. We do not know any better. That's not great, but it's at least honest. And if you put in a magic number is like always has to be 77, then future generations of engineers will work very hard to make it sell. Not understanding that you just have no better idea. So please write down the rationale and do you actually measure them and are they working? And if they are not, do you write postmortems and do the postmortem action items actually get resolved? Because if you have outages and nothing changes, you will have more outages and you will not deliver sustained value.

00:14:35

Now, another area that's very important is data integrity. It's something that's easily overlooked because as long as you don't lose any data, nobody notices that you don't have a proper restore plan <laugh>, well, when you do notice, it's kind of delayed. So it's important to talk about these things upfront. And first of all, it's important to identify what business critical data sets do you actually own. Uh, what data do you need to, you cannot do not want to lose. And for these data sets, do you have data integrity plans? We explain like why you need to get them back, how you need to get them back. What are the constraints? How quickly do you need to get them back? How much data is it? And again, get the signed off by a stakeholder. And it might just be like, this data set is very large, but it's generated so it doesn't make sense to back it up would only cause money.

00:15:39

We can just as easily and sometimes even more quickly, we generate it, write it down, get it signed off so everybody understands that. But if you need to back it up, also restore test it. Because if you back up things and never restore them by the time that you need to restore them, might turn out you can't. And you can do this manually, but it's not really yes way. Um, you actually want to have automation and do this on a frequent basis that, so do you detect any kind of change, any kind of regression early on that breaks your backups? And last but not least, capacity planning. You both want to make sure you're not wasting machine resources, but also you're not wasting engineering time on over optimizing your capacity management. So it needs to be s size appropriate. You need to look at the utilization.

00:16:40

You need to make sure that you have a learning because running out of capacity almost always means that your service is down. And you can also look at like how often does, uh, the team manually adjust, uh, resources allocation, um, because that is a sign for poor planning and, and room for improvement. So that is what's being discussed in the products review. And how does this impact our work? So first of all, in in the first year that we did these reviews, um, there were only 23% of all review teams that were high scoring high. And over the years, that has increased to 66%. And at the same time, the fraction of these teams that were scoring low, that were at risk, that had actual problems dropped from 44 to staggering 9%. Now you could argue that maybe just the review has gotten soft. Um, but we can actually also see this by the underlying metrics.

00:17:55

For example, the pager load, the incident rate dropped by 34%, and looking at the results over these many years, um, from statistical analysis, we see that actually data integrity is the most predictive section of the overall score. If you are doing poorly in data integrity, it's unlikely that you are a well performing team. And if you are doing well there, you probably are. It's a correlation. But it shows you how important it is to have a, a good grasp on, uh, data integrity. And last but not least, because, um, we ran so many reviews and, and we invested heavily in automation, we were able to save thousands of hours of leadership time, both from the managers and tech leads of the teams being reviewed and the reviewers, um, via the automated review preparation. And I would argue that is a critical ingredient that the program is actually being successful, that it was able to be adopted by so many teams, it was able to su um, to scale and, um, that the review fatigue is not that big for the program, um, to break down. Because if every review team would complain about like why would we have to do a week of preparation, um, then their leaders would probably remove them from the program. But instead, we still see more and more teams signing up to the program.

00:19:40

So what do our stakeholders say on our users? So Ben Trainers laws who founded SRE and is still our VP says it's one of his most important bits of telemetry for him and his leadership team, um, about the health of SE teams. So being able to aggregate that information up informs leadership about the strategy about widespread risks. Jessica, who is an SE director, um, for, for networking at Google, um, and one of our reviewers says, um, it's one of the fastest mechanisms, mechanisms to build insight into the challenge and the best practices of his teams from all corners of the company and share that knowledge back into the organization. So as a reviewer, which is a lot of work, it's still valuable for them because they get a cha chance to see, um, teams from very different parts of the organization and learn from them for their own organization. And Philip, um, one of the managers, uh, of an S3 team says, um, products helped us to keep track of our operational risk and provided valuable mentoring for our long-term strategy because it really helped his team to identify a fundamental problem they had with their strategy, something that they overlooked and that gave them a lot of homework on to think about and, and really restructure the team around that.

00:21:19

So that is, it's for me. Um, it thing that I would really like to learn from the community is like, how would you assess operational risk, uh, health of SRE teams and um, what are the metrics that you would look at? I know that the metrics that we have are not perfect. I know of some gaps that are kind of obvious, but there are other areas, um, that might, I might not even have thought of. And I would love to hear what you think, what really is the health of Naseri. Thank you so much.

00:22:04

Thank you. Christophe <laugh>, by the way. Uh, a testimonial from Trainer Schloss himself. That's awesome. <laugh>.