Site Reliability Engineers (SRE) are Google's specialists for designing, building, and running complex services that are reliable, scalable, efficient, and maintainable. The SRE Engagement Model describes how the collaboration between developers and SREs works, how SRE is funded, what kind of work SRE is best suited for, and how reliability engineering can be applied early in the service lifecycle.
Jennifer Petoff, PhD
Director, SRE Education, Google
Christof Leng, PhD
SRE Engagements Engineering Lead, Google
One of the prominent themes in so many devastated by summits has been site reliability engineering, which I've always been dazzled by. There are so many interesting things about these principles and practices that Google pioneered back in 2003, I think it's one of the most incredible examples of how one can create a self-balancing system that helps product teams get features to market quickly, but in a way that doesn't jeopardize the reliability and correctness of the services that create for over a decade, I've wanted to better understand why Google chose a functional orientation for the site reliability engineers. To this day, thousands of Google SRS are still in one organization reporting to Ben Treynor SLOs VP of 24 7 engineering, which includes SRE very purposefully outside of the product organizations. Undoubtedly, this creates certain benefits as well as certain complications. I can't think of two better people to explain the implications of this to technology leaders. Dr. Jennifer PED off is global director of SRE education and is one of the most widely cited authors in the space. She will be co-presenting with Dr. Kristoff Lang SRE engagements engineering lead, who has managed and worked on various parts of Google services, including cloud ads and internal developer tooling. Here is Dr. and Dr. Lang,
Welcome to our talk about the collaboration between Google site, reliability, engineers and developers. I'm Kristoff Lang, and I lead three teams and a horizontal function called Esri engagements based in Google's Munich office. These teams work on improvements for all of Google Esri. One of my responsibilities is to maintain the Esri engagement model, the collection of policies that define how collaboration between Esri and WL partner works, which is what we're going to talk about today. I've been working at Google since 2014 on various products and before Google, I was a distributed systems researcher, all that to Jennifer.
Great. Thanks so much, Krista. Hi everyone. I'm Jennifer pet off. Uh, but my friends actually call me Dr. J I'm Google's director of SRE education, and I'm based in Dublin Ireland. I've actually been at Google for 14 years. I like to think that time flies when you're having fun, you may be wondering why Dr. J well, I have a PhD in chemistry and started my career in the lab, working on things that could it start on fire or explode if you expose them to air. I currently lead the global Esri education team or SRE EDU as we affectionately call it. And I'm also one of the coauthors of the original SRE book that we published back in 2016. Of course, when we aren't living a pandemic, uh, I love to travel and I'm a part-time travel blogger at sidewalk safari. All right. So now that we've been properly introduced, let's start with some context, uh, context and complexity.
So if you think about it at a 2 billion lines of code, or more than 2 billion lines of code, it's not an exaggeration to say that Google's production environment might be one of the most complex integrated systems ever created. It's also highly inter interconnected, which is a key enabler, but also creates many challenges at this scale. Uh, the systems that you run maybe smaller than that, but it probably depends a lot on third-party code, external dependencies and maybe even a cloud platform, a cloud platform, or two of course, a, well, this is a case study about Google and shouldn't be applied verbatim to your organization. The challenges of scaling with complexity are, are universal. So given the scale and complexity, that raises the question, how do you run a planet scale system? How do you keep it stable? How do you add new functionality to it?
So let's, let's look at let's look at balance reliability and velocity are oftentimes at odds with each other in the traditional software development model, developers push for velocity to quickly launch new new features and ops pushes for, for reliability and dis to slow things down. Uh, this is inherent to the way the incentive structure is set up, but, but the reality is that you need both aspects. You need, uh, you need reliability, uh, and you need to also move fast. So how do you find the right balance and actually resolve this potential con conflict? So one way to do it is to sidestep the problem and, and, uh, w what we've done at Google is actually create a discipline that balances these competing concerns of reliability and feature velocity. Ben trainers, boss, the founder of SRE at Google actually describes SRE is what happens when you ask a software engineer to design and run an operations function. So Esri's come from many different backgrounds, but what they have in common is a mix of software engineering and systems engineering backgrounds. Uh, we know how to build and how to run systems, and we deeply care about both aspects.
So Esri's focus on reliability to meet the availability, availability targets our users need while maximizing long-term feature velocity as series also focus on maintainability. So ensuring that we aren't feeding the machines, that human toil, where we define toil is work that's manual repetitive automateable tacticals interrupt, driven, and reactive, and gives no enduring value. Uh, toil also grows linearly with service growth growth, which can be problematic. Of course, uh, efficiency is also important. So using engineering time and machine resources as efficiently as possible. And let me turn it over to Christophe to tell you a little bit more about the scope of what we work on
SRE at Google, unlike many similar functions that our companies is one central organization that takes care of many different areas across Google engineering, from user-facing products that everyone knows like search and YouTube to internal infrastructure, like network or developer tools that our users never interact with directly. Um, it has been tested, applied and adapted to many different contexts, um, across Google, over modern 20 years, almost 20 years. The overall organization is more than 3000 as reasonable days, which is big. However, um, Google S3 is grouped in what we call product areas, a group of related services and products, and each has overall say 50 to 300 nurseries. And each of these product areas, partners with a developer organization that develops, um, these, these products, uh, but a developer organization is typically much, much larger. And this, uh, cemetery is intentional. It keeps Esri focused on its core mission. It also means that you cannot offload all operational work from dev to S3, because that would easily overwhelm the much smaller Esri teams. The important part here is that as three receives its headcount from a Steph partner org, typically in the context of individual engagements, do you think engagement happen at a team level, but the engagement planning and funding is done at the PA at a product area level.
And the engagement is a peer relationship between an S3 and a deaf team. It's typically sculpt around a specific service or product and the relevant production assets or end user interactions. So a related group of things, uh, from an engineering perspective and such an engagement is not a one way street. It requires significant contributions from both sides, for both Esery and death. A common misconception here is that Esri only comes in after the services implemented and launched, but actually as reengagement can happen at any time in the service life cycle, or even cover it from start to finish, because each service is different. We talked about it, network developer tools, customer facing products, and every life cycle life cycle stage has different needs. The types of engagements are diverse. We'll cover that in more detail later. However, what they all have in common is that they are scoped around Esri's mission, reliability, velocity, maintainability, and efficiency, and a shared set of principles over to you, Jennifer, thanks so
Much Krista. So of course these principles are called the SRE engagement model, and it really describes how the SRE and SRE and dev collaboration actually works. Let let's walk through that in a little bit more detail. So the first thing to note is that SRE support is not automatic. So SRE is a scarce resource by design. And in fact, many services at Google are built and operated by their dev team with, with no SRE support at all, a few, a few things to call out. So SRE teams are funded by dev. So it's their choice whether to invest in SRE or not. And, uh, once transferred, SRA is responsible for that, for that headcount, uh, production excellence is a multi-year investment. So engagements are not considered in isolation, but at the SRE product area level, building an SRE team takes a minimum size, typically two sites with at least successories each and time to build up that deep understanding of the services that the team is responsible for the service itself, and its reliability are ultimately owned by the dev team.
Even if the day-to-day production authority rests with SRE responsibility for having a reliable service is not offloaded onto SRE or thrown over the fence. So to speak as to whose job is to help the dev team to, uh, meet their reliability and velocity goals and to meet the needs of our users first and first and foremost, so starting and continuing with an Esri engagement is a joint decision for both, both the dev and SRE organization and the, and the teams involved, both sides need to agree to start in engagements. And either side can end that partnership dev can't force the service onto SRE and SRE has to give the service back when the, when the devs actually want that. It's important to understand that, uh, at Google, both sides are doing this because they, because they want it, they, they, they recognize the value of that partnership.
So, um, what should sob work on? So if SRE support is not universal, how, how do we decide what, what they should work on? So, so first and foremost, it always needs to align with mission. Uh, SRE is a specialist role. It does specific things. Uh, the, the, uh, Esri's look, what Esri's work on should have a clear value proposition. So the idea is that SRE should only take on work that Esri can do significantly more efficiently than anyone else. Uh, what's the value add? So if the work can also be done by the dev team, and just as, just as easily dev should keep that head count, staff, additional developers are giving them more flexibility and less overhead. Uh, the work that Esri's do must also be impactful, interesting and challenging for the SRE teams. So there there've been circumstances where devs think SRE support is, you know, someone holding the pager for you, but SRE is not an ops team.
Our mission is to handle is not to handle operations, but to improve the inherent reliability of systems through engineering, doing ops work as a means to an end. So we want to understand what breaks, how to fix it and how to fix it once and for all. And of course, it's important to point out that ops is not a zero sum game anyway, instead of moving operational responsibilities from one place to another, and estimate engagement should focus on reducing the overall ops workload. So it's a win for everyone. So overall we're aiming, we aim at finding engineering opportunities that lead to sustained long-term value in terms of service health. And, uh, these are often not obvious to a typical engineer, but we basically need a specialist.
So how do we make work more impactful? So now that you've found what to work on you, how do you, how do you make sure you're successful? So it's important to think of managing a service as a shared endeavor. So SRE and dev bring different expertise, but they work towards a common goal, the success of that, of that service and to avoid comp uh, conflicts. It's important to agree on what success means beforehand. So what's what is success in our user's eyes? So SRE and dev maintain a shared roadmap that w with goals that are, it can be objectively measured and tracked. And this includes regular reviews of both service, health, and priority using service level objectives, or SLS and error budgets, uh, sort of fundamental SRE principles. This, this is a standard technique for ensuring both objectivity and balancing of reliability and velocity and get getting everybody sort of rowing in the same direction.
So that was an error budgets, promote a common understanding of reliability goals and a common language, and it basically a tool to measure, measure success. So once you've achieved your defined goals, it's time to think about adjusting investments. Is this still the most impactful area for SRE to work on? You know, maybe the team can engage with new services, maybe new topics have come up with this particular service. Maybe the scope has broadened and you need more headcount. So, so, you know, the, the it's, it's also possible, you know, sometimes it doesn't work out for whatever reason. Uh, we'll talk a little bit more about that later, but depending on the situation, um, you may want to double down on the investment or change. Esri's focus to a different topic, but whatever happens engagements and their funding should be regularly reviewed, uh, headcount should always, and, and resources should always be allocated to the most impactful work.
Alright, so SRE is really focusing is about focusing on the important stuff for SRE. There's always more to do than there is time. So it's really critical to focus on what matters most a SME engagement should be scoped to a set of services with clear correlation and boundaries. You can't boil the ocean. Of course, , don't work on production health for the, for the sake of engineering merit. They're an advocate for the user. They're, they're a champion for the user and for the user's experience. So look at reliability end to end with customer centric slows. However, there's also infrastructure improvements that may not be directly visible to the user. Uh, things like, uh, converging towards standard platform, standard production platforms are important because that helps you to move faster and really increase that feature velocity. It reduces the cost of implementing horizontals operating services and moving them between teams cognitive load from needing to know many different tools, many different architectures is a major bottleneck for SRE teams to scale.
So standardization really helps with all of that. And finally, um, highly customized infrastructure also makes it harder for devs to understand production, but to be able to build a reliable system, the devs need decent production knowledge. Of course. So SRE is, should always teach teams to fish rather than providing fish. Uh, otherwise there's a risk that SRO SRE will become a human abstraction layer for production and behind that wall. Uh, that's, that's an invitation for complexity to flourish. You can't build a wall and then complain about a, throw it over the wall mentality. I'm going to turn it over to Christoph to talk about engagement types.
Thank you, Jennifer. Okay. Now that definitely learned about the core principles. How do we apply them in practice? As I said earlier, as we can engage in any phase of the service life cycle, we've all seen how important it is to integrate testing security and other topics earlier in the life cycle. What they call to shift left the same applies to reliability during the sign and implementation. You make many decisions that are incredibly hard or practically impossible to change later. Architecture, technology fail over capabilities and so on. When a production expert has a voice at the table, you can fix problems before they actually happen. So it's super important to have a series in the design discussions early on. For example, SLRs are often only discussed once the implementation is done, but does the architecture you pick scale to the number of nines of reliability your customers expect, if not, you either have to redesign your whole system or disappoint your users, not a great decision to make, or you architecture is actually much more sophisticated than what would be needed to satisfy your users. We have probably wasted precious time and resources that could have been invested in time to market or additional features. So having this conversation so early on with the user lenses on with production knowledge can really help you to be more effective in development, but every lifecycle stages and services different, we can't have a one size fits all approach to engagements.
We categorize our engagement types into three broad buckets, baseline, assistant, and full. They require different levels of what we call commitment, not only in terms of headcount funding for the S3 team, but also in terms of project time, invested by the developer team compliance with best practices and coordination overhead between those two teams, developers and Esri's hire is not always better comes at a higher cost, especially for the earlier life cycle phases. With a high rate of change, a lower tier engagement can be more effective and then Gatesman can transition from baseline to assisted to full support eventually, but not all services too. It depends on the business priority and the need for Esri involvement services can also transition back to an engagement type with lower commitment sometimes because they don't meet the bar for Esri support anymore. And sometimes because they're mature enough that the SOE work can be scaled back to developers are happy to take on these additional responsibilities because they're not very demanding anymore.
There is no expectation that an SRE team has a specific mix of engagement types from V3 pipes, or is even using all of them. It really depends on the situation. If you're running the core infrastructure for your organization, you may focus on a handful of full support engagements. If you're working with a new business area, with many experimental services, you could do many baseline and assistant engagements. Instead, let me walk you through the different engagement types and give concrete examples. Baseline is the entry level engagement it's tactical and reactive. It's open to everyone and scope for given S3 team. It's included in the price of funding that S3 team. It consists of ad hoc support. For example, office hours or consulting projects. It provides easy access to production expertise, but the execution lies with the developers. It can also apply to incident response. For example, as we could give an escalation on call for the deaf on calls to ask for help during a may trial DSS DSS on-calls may not have detailed knowledge about the particular service, but they can often help with generic production knowledge, which the developers might not have to the same extent.
Additionally, the S3 on call can handle escalations to backend dependencies or communication with stakeholders and allow the developer on call to focus on debugging and mitigation.
One possible example of how to implement baseline is what we call SRE love. It's a program for box consulting projects, typically two to four hours per week for a quarter, the developers submit a proposal for someone for the Esri team and they help them to execute it for there. So typically a call every quarter, you submit your proposal. The yesterday team picks, whatever they can do. And the focus here is on knowledge transfer, not on Esri, developing the project for the developers. It helps to improve the causation early in the life cycles for offer services that don't have any other type of Esri support. So the S3 mentors, the depths to do the work themselves and makes it a lot easier for them. It's also often the first time that this deaf team interacts with S3, it helps to build a personal relationship between both fives put faces to the function, the depths gain a better understanding of what Esri can do for them. And as Reese learn about a new service that may become relevant to them at a later point in time, either because it moves to a different, I get to spend time or it's a dependency during an outage. So it's good to know about it.
Next up is the assistant engagement. This time provides a longer term strategic engagement. There's a dedicated point of contact on the S3 side. And typically also on the deaf side, and this triad roadmap is defined. The focus is on engineering projects that improve production health. Obviously it can include code or redesign work productionization infrastructure, migrations, and many other things. It does not usually include operations. The service is still operated by the dev team. Sometimes individualize reas may join the deaf rotation temporarily to gain a deeper understanding of the service, which then helps them with that project work. When this type of engagement is applied at the right time, it can provide huge value that pays off for years to come. Even though it does not usually include any kinds of operations.
One example for an assistant engagement is embedded as Surrey. Typically one or two, Esri's join a deaf team for a particular project. They're still part of the Esri home team and participate in its own core rotation, but they spend all of their project time on their staff team. Is this a significant investment? It is reserved for truly critical project. And for situations where these SMEs can be force multipliers for the deaf team, for example, it can be used to apply as three principles during the design and implementation of a major new service, or it is used to prepare the onboarding of the servers into the full support tier full support is the most expensive type of engagement. It requires substantial headcount, investment, and continuous contributions from the left team to services to get there. We'll need to meet high above production actions and Esri becomes the effective owner of production.
Surreal runs production for the deaf team. The development team is still also responsible that the service is reliable. It's not responsibility is not offloaded, but S3 will do the bulk of the work. The most obvious attribute is as that Esri has this on-call, but the goal is to keep a broken service afloat with on-call heroics, any straightforward production issue can be fixed much more efficiently in an assistant engagement. So if the service is broken to not try on border to full support, fix the obvious stuff. First, once the low hanging fruit has been dealt with the Esri on-call work provides the additional comprehension necessary to solve the less obvious complex production problems. We really need skin in the game. You have to see the service go up in flames during production to be able to analyze it and to give good recommendations on how to rearchitecture it, both sides should work together towards simplification to reduce cost of operating the service up to a point when I've aside cares, who cares the pager, and then it might be a good point in time to go back to the assisted engagement and save some Esri time.
A good example for this is Esri's mantra of automating yourself out of your job every 18 months sounds officious, right? That can be done either through incremental improvements to service infrastructure or fruit, pivotal changes to a new approach. I have a way it requires an intimate understanding of the service, which is typically built for day to day involvement in production and engineering. The goal is to reduce the need for continuous Esri involvement to make more time for more exciting projects. In some cases, handing back to service to deaths by moving to an assisted engagement. Well sometimes to simply keep up with the crowing support load of a rapidly scaling system. If you will growth the production work, the ops work will go through the roof. So you have to cut down on it just to be able to keep up. It's essential to reduce the cost of complex high touch systems before they grow out of control and become what we call a haunted crave yachts.
That's the time investments, both for operations, project workers high, the full support should be reserved for mature and business critical systems. The most important things that your organization runs. But even for those, there's typically plenty of opportunity for improvement, no matter how mature the system already. Okay. But what do you do when it doesn't work out? For example, operational load gets out of control. The service is unstable SRS and deaths don't see eye to eye anymore, or death has completely disengaged. Don't panic apply the best practices for incident managers, introduce the strategic level. Don't fight the symptoms, but understand root cause and prevent recurrence. What you would typically do doing production outage also applies to the engagement level, trying to come up with a strategic plan to fix the identified issue, get buy-in from a deaf partners and potentially critical dependencies that you rely on. And if there isn't enough engineering time to execute that ask, you know, leadership to declare that the work required to fix the problem trumps all of our project work. We call this a code yellow. If you can't agree with your deaf partners escalate up on both sides of the management chain on the S3 side and on the deaf side.
And if that doesn't help, perhaps it's time to disengage to hand back the pager. And if you do leadership, doesn't want to do that either. Well, ability for SRS is typically higher inside of Google. Um, then it's maybe time to start looking for different S3 team. And if there is no Esri's left on the team anymore, while the Pedro has handed it back, anyway, this is not what typically happens. Everybody understands that, um, the need to be kept happy as well. You can't throw them under the bus and, um, the developers understand the value that they get out of it. So normally, um, you don't end up there. You agree on a situation, whatever you do remember that heroics are not sustainable. You can't firefight production forever. Neither can you work day and night. It's not sustainable solved a problem for smart engineering, not brute-force. It also has to remember the engagement principles. It is a Shannon endeavor. We must set a reasonable scope and adjust investments when needed invest in your deficit relationships. Spend more time together United you stand divided. You fall. Okay. That's
Alright. Thanks Kristoff. So here's Google SRE in a nutshell. So Google SRE is a specialist or organization that takes a principled approach to balance reliability and feature velocity while maintaining, uh, keeping maintainability and efficiency in minds. As a research partner with dev teams to solve hard engineering problems, that would be a lot more difficult for the devs on their own. Uh, S the reason devs don't work against each other with it conflicting incentives, but together towards shared goals, which are codified through things like service level objectives and error budgets, SRE helps dev to build their production muscle. And ideally the collaboration should begin early in the software development life cycle. So each engagement is different, and we pick the approach that fits the needs of the services, uh, the service best. Uh, but of course they share a one, but of course they all share the same principles.
And because the service, the business and outside factors are constantly changing. The engagement needs to be adapted regularly to stay impactful. So it's not one and done. You have to keep, keep revisiting and keep, uh, you know, keeping it, keeping it fresh as, as we wrap up here. Um, I just wanted to call out if you want to learn more about Google SRE, we've actually published three books now with, with plenty of content that you can check out. So, uh, feel free to read them for free email@example.com. So original SRE book, a site, reliability workbook, and building secure, secure, and reliable systems, and, um, to wrap things up as well. Uh, we'd like to engage with you of course, uh, and Peninnah pun intended in this particular particular case. So, uh, I'm interested in knowing, you know, how does, how does SRE work at your organization?
We've, we've talked about how SRE works at Google, but we recognize that different places do SRE differently, and there's no single right way. So we would, of course, love to learn from you and what you found to, to work. Uh, personally, I'd also love to know what other SRE topics you'd like to hear about. So we're always looking for inspiration for conference talks, blog posts, and publications. So, um, you know, and if you're, of course, if you're wondering if we've already published on a topic you're interested in, uh, feel free to check out as to.google and cloud dot, google.com/sre for the latest, and, uh, let us know what those, what those gaps might be and where we can potentially publish publish more. And finally, the great thing about conferences is the chance to connect with people. Uh, that's certainly harder when everything is virtual, but we'd still love to try Chris.
Chris, I'd love to see in person at some point it's been a while, but, uh, and, uh, you know, Chris, Chris off and I, of course, would welcome the chance to connect on Twitter or LinkedIn. We've included our coordinates on, on this particular slide, uh, for me on LinkedIn, if you send a note with your invitation saying that you saw our, uh, uh, DevOps enterprise summit talk, that would be, be helpful as well. Cause we do get quite a few requests to connect, and it's nice to have that have that context. So thank you everyone. Thanks for tuning in today. And we're looking forward to, uh, chatting with you all on slack and get in getting your feedback and, and, uh, hearing your questions Christoph. It's been great. And, uh, thanks so much. Thank you.