Service Level Objectivity: Improving Mutual Understanding Through the Language of SRE Accepted (MediaMath and Google) (Las Vegas 2020)

Prioritizing reliability is hard. Is our service stable enough? Are we shipping features fast enough? Not only do different disciplines (operators, developers, product managers, executives) have different perspectives, they may speak in very different terms: how does “increasing disk I/O” translate to “greater market share?” Site Reliability Engineering (SRE) provides a framework in which shared goals can be described in a common language. In this session, Adam Shake, Director of Site Reliability Engineering at leading AdTech firm MediaMath, reflects on an ongoing initiative to reimplement and improve SRE throughout the company. It’s a journey that requires engaging diverse stakeholders in frank discussions about tradeoffs between reliability and feature development. Attendees will hear a first-hand account of methods that work to align teams, and how small improvements in communication can have a big impact on technology outcomes.

vegas2020las vegasbreakout
AS

Adam Shake

Director of Site Reliability Engineering, MediaMath Source

DS

David Stanke

Developer Advocate, Google

TRANSCRIPT

00:00:14

Hello today, we're going to talk about the language of site, reliability, engineering, or SRE and discover best practices for helping people achieve fluency in that language. My name is Dave stanky. I'm a developer advocate at Google, where I focus on dev ops, SRE and software delivery. Over the past year, I've had the pleasure of collaborating with Adam shake and his team. As they've worked to establish an effective SRE practice in their organization. We'd like to tell you a story and Adam is our protagonist.

00:00:42

Hello? I am Adam Shaykh. I'm the director of site reliability engineering at media math. And let's start at the beginning.

00:00:52

So a little about me for many years. I was a developer. I spent most of my career as a, as a, a web application developer, uh, eventually transitioning into an operations role where I was doing support and operational work. Um, and you know, that kind of began my path, uh, toward understanding dev ops and SRE. And I found myself drawn, not just in the operational side, but also when I was a developer toward automation. Um, I spent a lot of time writing code to do my CIS admin work or even, uh, some of the development work for me, which just, you know, freed my time up to do things I found to be more interesting. I started learning about dev ops and SRE and realized that I had been doing a lot of those things without even knowing what they were kind of that whole, I was doing it before. It was cool thing. Um, I tried in fact that my last, uh, my last job, um, to implement SRE after I learned about it, uh, at a, at a conference, tried to implement it at that company and had some limited success, uh, in, in doing that. But what that did was that taught me a lot about what SRE isn't. I learned a whole lot about what not to do. Um, so let's take a second now to turn to what SRE actually is.

00:02:07

So SRE is what happens when you treat operations like a software problem. When you apply the software concepts that we've been using for years like reusability composability to systems management, it's a set of principles and practices that have been developed over the past decade or more at Google. And we developed this framework because we needed it because as we grew, the limits of human labor had become a bottleneck. We couldn't sustain growth. If every time we wanted to add a new server, we had to hire a SIS admin to operate it. So we committed to sub linear growth of the operations team. However fast the server fleet grows or the user base grows. The SRA team has to grow less while at the same time continuously improving the reliability of our services. We constantly challenge ourselves to achieve more and better outcomes per unit of human effort.

00:02:59

The way we approach that challenge is SRE. Now Google and many other companies have adopted SRE and are part of a global SRV community. As a re is founded on three key principles. The first principle is that the most important feature of any system is this reliability. Well, this begs the question is reliability, a feature. We don't usually talk about it that way. When we talk about features, we tend to mean functionality, new gizmos to play with, but reliability is an aspect of a system that can make it more or less valuable to the customer. Customers care about it. And it's something that takes work. So it has to be prioritized alongside the other features. Is it the most important? Well, your customers might not always ask for reliability upfront, but they're more likely to ask for functionality, but if your reliability starts slipping, then fixing, it will immediately become their top priority, which brings us to SRE principle.

00:03:58

Number two, our monitoring doesn't decide our reliability, our users do, you know, it's easy to get signals. It's not always easy to get signals that matter. You may have had an experience like this. I've often times been woken up in the middle of the night by an alert. And, uh, an operator says, Hey, the CPU usage is spiking. It's, it's really high on all of the servers. And I kind of groggily go to my web server, uh, my, my web browser. And I say, well, I don't know, the website looks fine. So the person on the other end of the phone says, what do you want me to do about it? I say, I don't know, I'm going back to bed. Let's talk about it in the morning. This was a, a metric, a signal that really didn't matter to our users, or perhaps you've had the inverse situation.

00:04:43

I've had this where my boss comes to me and says, Hey, our number one client is yelling at us because the website's down. And I look at my monitoring screen, I see all green. I say, what do you mean everything's working? And then my boss says, well, go look at their page. And I go, and I say, oh, oh, it is super broken. Maybe it's even broken for everybody. But in a way that I didn't pick up in my monitors, guess what the customer's right? The site is broken. So signals that are user oriented is a key principle of SRE. And the third principle is this in order to meet our reliability goals, we need, of course, well engineered software, but we've learned a great software. Isn't enough software alone. Can't make systems reliable. We need operations that are aligned around meeting our customer oriented reliability goals.

00:05:36

And we business practices that empower both of these teams. Perhaps most importantly, we need all of these people to share a mission. And for that, we need to find common ground. We're going to argue. We need to argue effectively. SRE gives us the principled way to argue about the desired reliability, a service. It helps us move away from making decisions by shouting louder than anyone else or avoiding those decisions entirely. It gives us a communication framework to help make tough choices. Like what are we going to promise? And to whom how do we define what good means for our service and how much of that good do our customers need? SRE is a language. It's a way of communicating about reliability. And here's the key vocabulary of SRE. First off is service level indicator or SLI. This is a customer oriented metric that tells us how well is our system performing.

00:06:34

Uh, an example might be how many of our pages are loading with less than two seconds of latency that could be anywhere from, you know, 10%, 60%, a hundred percent. The SLI has no judgment on it. It's really just a measurement. What is the indicator from a customer perspective, then the SLO service level objective, that's a target. That's the number that we want to hit. What percentage of page load do we need to have under two seconds to keep our customers happy? And that target should not be a hundred percent because we know we can't get to 100%. And if we, even, if we tried, we'd be leaving things on the table, we'd be, there's an opportunity cost to not using that time to delivering functionality and other features that the customer wants. So we set our SLO somewhere less than a hundred percent at a level that is good enough to keep our customer happy, but not so much that we're wasting effort on unneeded reliability and then error budget.

00:07:34

So if we're targeting less than a hundred percent reliability, then by definition, we're targeting more than 100, more than 0% unreliability, that amount of unreliability that we expect and are in fact trying for that's our error budget, because we know that we need some room to fail. We need to experiment. We need to release features, which is always a risky thing. And to every budget tells us how much room we have to make those little errors, uh, before our customers are going to get really upset. When we have a lot of error budget, we feel free to release new features or try new things. When our error budget is low that's when we want to pull back and focus on reliability. Finally, I want to introduce one last bit of SRE vocabulary. This one's a bit of a dirty word it's toil. This is the stuff that keeps us from achieving that sub linear scaling specifically.

00:08:28

It's the manual repetitive automateable work that's void of longterm value. Uh, one great example is SSH into a machine and restarting a service that's boring. It's rote it's procedural, and it's a bad use of a human's time that human with their creative, flexible, squishy human brain, they could be doing something with much more strategic value on top of that. We're not even very good at doing this sort of thing. Humans are liable to make mistakes while doing this kind of work. That's not rewarding. That's a lose lose. So SRE is work to identify and eliminate as much toil as possible. Now let's turn back to Adam's story as he starts a new job at media math.

00:09:12

Excellent. So briefly, what is media math and, and why did I want to lead an SRE team there? So media math is an ad tech company, uh, built on software. Um, we're kind of, at least from my perspective, transitioning from a, uh, smaller almost startup feel almost into that enterprise world, right? We're, we're kind of going through a bit of business puberty, at least from my perspective. Um, and that requires, you know, certain things, right? And the organization, the systems we support are highly complex, highly distributed, not unlike Google. Um, although not quite at that scale. And currently, uh, media math depends heavily on humans to maintain, uh, the systems that are highly complex and distributed, right? It's, it's a bit of a failure mode and constant growth as is meaning that more and more humans are having to be thrown into the, into the mix.

00:10:08

So really the situation of this maturity of the organization, combined with the complex, highly distributed, uh, service that we provide really demands something like SRE. Um, and so once I learned about those things, I quickly discovered that it was something that I wanted to help be a part of, um, from a mandate perspective when I was hired and, and through conversations with the, with my boss, the hiring manager at that point, um, we started talking about certain things and one of media mass principles is that teams work better together. Um, and so my primary mandate upon being hired was to take a team that had been kind of scattered through the organization siloed, uh, to some degree and bring them together and form an SRE team or group of people that work well together, as well as then instituting kind of a proper, more traditional SRE practice in the organization.

00:11:04

So we could address all of the things that were occurring. Um, you look at, you look through the organization and even still is happening. Senior leadership across media math were realizing that there were some reliability problems that needed to be solved. Customers were noticing some things. And so a big part of my mandate was to say, let's use those core SRE principles to focus on those reliability issues. Um, but I learned quickly upon starting that the SSRI's weren't doing that reliability work, they were focused on other things. So there are a lot of competing priorities, uh, in this, the goal of the, of the organizational transformation. Uh, we had to compete with multiple things in order to complete that this transformation of this SRE team, right. We knew customers were complaining. Uh, we were, we were very aware of, of the situations where they were not happy with the reliability situation, they were experiencing outages.

00:12:01

Um, and, and even in some cases, uh, threatening to leave, if we didn't address these things right. And so that reliability need that desire or need to increase our reliability was kind of juxtaposed against a major infrastructure effort. Um, that started shortly after I joined the company as well, a massive data center migration server refreshes all of the things, right, th th this huge effort, and this effort was significant. It's monumental even, but it was incredibly necessary and not just for, you know, physical reasons, but these changes were going to be a part of how we increase the reliability and efficiency of the organization so that we can address those needs of our customers. Right. Um, but we never took the time, or at least didn't have the time at that point to stop and think about the automation side of things, to think about, uh, the reliability side of things, and maybe inform how we approached this massive effort using some of the principles that we're going to, that we've already talked about today, and that I was being hired to kind of help, uh, champion across the organization.

00:13:11

Right? So in the end, this massive effort, that's juxtaposed to some degree to the reliability, things just required a tons of hand, uh, a ton of hands-on, uh, labor by the SRE team, um, which is kind of an anti-pattern to what SRE should be. So kind of chapter one of the story, if we start to get into the details of what happened, right. Once I kind of got up to speed, I started, I met the people. Um, I started to understand what the real truth on the ground in the organization was over several years. And I think it was about eight years that there's been a, an SRE title of some sort of at the company, um, SRE had evolved, or in some cases, devolved, um, into the wrong things, right. Uh, SRPs were doing request driven, CIS admin level tasks. Um, we had worked through sort of an abandoned tribe model that had left these SRE separated on their own silos and islands, uh, distant from each other, not working together.

00:14:11

And the operational load just continued to increase on the SRE team because they were the ones that either knew how to do it, or had done something similar in the past. And it just, it keeps snowballing, right. There was no community in the team, no shared knowledge, no economy of scale, because the team was just so separated and so involved in their own silos that they weren't working together. And clearly with all of this operational burden with all of this SIS admin work, there was an incredible amount of toil. Uh, the team still to this day is trying to process through an incredible amount of toil and all of this. When you've got folks with an SRE titles that want to do SRE work, but you're not giving them SRE work to do you end up with one with unhappy SRS. Um, and when you have Annette BSRs, they're delivering a very limited value to the company.

00:15:01

The work they're doing clearly is work that needs to be done, or it needs to be automated, or has to be accomplished in some way, but it was not the work that these, that these people signed up to do. So, um, I said about, uh, pulling the existing people back from these kind of silos from the margins back into a team that was my primary and number one mandate, right. We also had some attrition, right? We had folks leave for other opportunities. We had folks, you know, burn out, literally in, just need to go. Um, and so we started also reshaping the team via hiring. And I think hiring is a hiring strategies are absolutely critical to effecting change. Um, especially when you're transforming from something that's not quite where you want it to be to where you want it to be. Um, and, and so we set about very consciously and very purposefully hiring the right kind of people that would help inform where we were going.

00:15:59

Um, so an SRE team has to be made up of people who are well-equipped to do SRE work. They have to understand the language of SRE, which is exactly what we're talking about today. Um, and people that can go out and communicate that language, go out into your organization, build relationships, help get people on the same page, um, and kind of start resetting the expectations, not only internally the team, but across the organization. So at that point, we were starting to gel, we're starting to pull the team together, and now it's time to start learning, right? This is where we really started to get into the meat of understanding the language of SRE. So at dev ops day, Chicago in 2019, um, I was lucky enough to get to speak, to give it a night presentation, but I also, at that same event saw that there was this art of SLS workshop being presented by a couple of incredible Googlers, right?

00:16:51

So Nathan Harvey and Jennifer Petlock, um, gave this, this workshop and I attended knowing this was something that we needed to do, and I was absolutely blown away by it. Uh, it was, it's an incredible way to, um, to understand and to get a feel for, for what this language is, what the terms mean. Um, and so I attended immediately after that presentation was over, I ran up to the stage, started talking and saying, Hey, I think it would be incredible if you guys were to come and do this workshop for my team. And so in October of that year, um, Nathan, uh, graciously agreed to come and, and give that. And really this was the groundwork that I felt needed to be laid, right. We needed to all be saying the same thing when we said the word SLO, we're saying the same thing when we said the word SLI.

00:17:40

Um, and so we started to conceptualize even prior to the workshop, the idea that this language, that, that understanding what this meant is so important before we can write a useful SLO, we all have to agree what SLO means. And we were seeing that that was not true across the organization. So October comes, Nathan comes in, delivers this, uh, this workshop to the team. We actually spend an entire day with Nathan, uh, talking about everything SRED including this, um, all the way up to building some SLS, uh, kind of live in front of him on whiteboards. And it was incredible. Um, the, the light bulbs going off on my team alone, including my boss. Uh, we're just, we're incredible. The entire team was very energized and pumped about this. And you'll notice on the slide here, that there a link to, uh, some resources for this particular, the same workshop that we went through.

00:18:35

I would highly encourage you guys to take a moment and look through that stuff. It's incredible. Um, so we get through that workshop. We're pumped up, we're hitting the end of the year and we start planning for 2020, and we created these Dave coined this term, as we were working on this presentation, we coined these, uh, or we created these Mehta operational goals, goals to define how we were going to approach operating our own world. Right. Um, and, and I think goals are incredibly important, especially when you're setting foundational goals like this, because you can't hit a target you're not aiming at. Um, and if you don't agree, what is your aiming at, in the first place, then you clearly can't hit anything. So that comes right back to this whole discussion of language. We have to agree on what we're talking about. Um, so we get through the first part of the year kind of January. And we had, we were given an opportunity to take what we had learned through the workshops, through goal setting, through discussions we had had as a team and present that back out to a different team in our organization. And we presented it as a language thing. Uh, we were trying to ensure that we were all staying on the same page. And again, light bulbs in that room started to go off as we were making this presentation.

00:19:49

So I, uh, joined the summit that day, and it was really inspiring people who had been working together for years, we're finding new ways to communicate. And they were discovering common ground that had always been there, but they had never been able to surface it. Uh, one conversation that really resonated for me was this moment when Adam was leading the group to develop an SLO for this reconciliation batch process. And the question was how long of a delay is acceptable? How stale could the results be before customers would be upset? And as these folks were discussing it, their suggestions ranged from 15 minutes up to 48 hours down to like a matter of seconds. It became really clear that each of these people had been working under their own assumptions formed by their own unique knowledge, their own unique relationship to the customer, but those assumptions were wildly different. And therefore they had contradictory ideas about how to approach reliability. They didn't have the right language tools to talk about reliability, so they weren't talking about reliability. And as they started to use the language of SRE to reveal those assumptions, it became clear that what they needed to do was replaced those hunches with data and that data needed to reflect the customer experience.

00:21:04

Absolutely. So that particular event was incredible. I was really appreciative of, of not only Dave, but Nathan was there, um, and a few other folks from Google and, and they really helped support the message that we were trying to get across. Um, it was an incredible experience. So around that same time, uh, one of my SRS, in fact, one of the ones that we had targeted as part of our hiring strategy, uh, had a great accomplishment. I like to say, and he does the same thing that he built the glass castle, um, and Curtis, his name, Kurt got a rapid brain dump from, uh, one of the resources that he was replacing. In fact, we literally had to change Kurt's, uh, start date, uh, to just before the first of the year in order for those two brains to be in the same virtual room and, and pass knowledge.

00:21:52

Um, but Kurt rapidly moved forward with the, the principles and the language and the tenants of SRE and began to work, delivering a fully functional push button deployment for a team so that they could self-service deploy on their own. This was the first time an SRE that I'm aware of at the organization had gone through that process of building this out. Um, it was, you know, prior to that, it was a very highly manual SRE driven deployment and Kurt through his process, moved that from taking all day with all this manual effort to a 15 minute single push button deployment by development. Um, and I may be oversimplifying that a little, but it really was an incredible example of what the work SRE could be doing to enable, uh, this team to move faster, right? That team, because of this is now having discussions about increasing their deployment frequency.

00:22:47

There is no longer a reason for us not to deploy multiple times a day is, is, is something I've heard them say, I'm seeing that that automation changed a mindset, right? We can clearly go faster. We can address issues faster. We can address reliability issues faster. We can deliver features faster. We can do all of these things faster than, than we were before, because we don't have humans getting in the way. So SRE, one of the, one of the kind of hidden principles, I think sometimes is that SRE should be increasing the velocity of dev teams, right? When you start talking about things like air, budget, and SLO is oftentimes people assume a negative connotation to that. Like it's a punishment. Um, it's, it's really not. It SRE should be doing things that are focusing on the, on the velocity of those teams, because that affects not only feature development, but our ability to respond to issues.

00:23:35

And this glass castle that Kurt had built had started to demonstrate that value to the entire organization. People started to notice, um, so that us to where we are today and where we're going, uh, there's of course continued challenges and competing priorities and things don't always go the way you want them to go. Um, but we started to figure out what we needed to do, right? We've we've made it past most of this huge infrastructure project we had in place. Um, you know, we're, we're, we're, we're starting to focus on what's next. And I managed to get an opportunity to speak up the chain, uh, to senior leadership. And I think for our particular paradigm, our situation, this sitting down with a senior leadership, uh, person was absolutely crucial and they can be in your situation too. Um, I've been engaged now in a series of conversations with, uh, with multiple leaders, uh, specifically senior leaders, but also leaders, you know, underneath him and this particular senior vice-president that, that I'm working with him.

00:24:37

And I seemed, especially even during our first conversation, seemed to be conceptually aligned with where we're going with what reliability engineering was supposed to do, but there were also some clear misunderstandings, including definitions, getting back to the whole language thing that we needed to resolve. And, and some, some examples of that were, uh, this idea of LSR should be writing feature code. They should be, you know, retiring story points, 90% of the time, a hundred percent of the time versus the, the kind of code that SRE will, will oftentimes be involved in whether that's automation or scripting, um, you know, a way some task or observability or whatever those things might be not to say that SRS can't write feature code because they probably will at some point, um, another great example. And I think this is a, uh, an absolutely perfect example of the language side of things.

00:25:29

This particular SVP said to me, I don't want to talk about automation. I want to talk about reliability. And I almost, I had to stop at that point and, and, and, and realize that he hadn't made the connection, that automation is the path to reliability. And that is one of the founding tenants of what SRE is. So this definition of SRE was clouded by this idea that, oh, we want to focus on reliability, but we didn't realize that that meant automation. So light bulb moments for leadership started occurring. Right. Um, we also talked in depth about, uh, SRE not being, not scaling what the size of the organization, right. That it's sub linear. Um, so lots of, lots of conversation around that was occurring. Um, then we get to SRA two dot, oh, this is, this is how we kind of branded what we're doing, right.

00:26:19

Uh, we're working to, to ratify a comprehensive SRE model that we've been working on. Um, we're using a job description as the method by which we're accomplishing, uh, that, um, and then the hard work starts. We need to get head count approved. We need to extract the SRAs from this heavy operational burden. We need to do that in a way that doesn't cause fear among, uh, the rest of the teams that we, uh, we supported, we've created a responsibility inventory. So we know what that work looks like. And we're going to work to quell those fears and have conversations kind of in a custom way with each of those teams. Um, so that they know we're not just trying to throw bodies or not get the toil done, but that we're going to work on solving these problems. Communicating is really the hard work that has to happen now.

00:27:01

And without the right language and culture in place, we were doing way too much manual effort in order to compensate for things, right. We need to stop and talk about reliability before we do a bunch of things, um, and ensure just consistent definitions across the organization. Every time we go to talk to someone and we say the word SLO, we need to be meaning the same thing. And I think that that's super critical to the success of what we're going to be doing moving forward. So the moral of the story SRE is a framework for connecting the needs of customers and organizations with the data that tells us what we need to do. It's all about data. You can't answer how bad or what's wrong. If you don't have actual data to lean on, right? Or reliability is bad, how bad our reliability has to be better.

00:27:43

What is actually better? You have to define those things. We also need to clarify the language itself. We're working to define what SRE is. We have to come to a consensus on what that is. We all need to be speaking the same language. We need to know that automation and reliability are not separate. We need to know that SRS are not CIS admins and that SRA should be writing code a lot. Um, and then evidence, right? This is the whole glass castle thing. As you get through this process, as we've been working through this process to define things and get the right language in place, we have to build examples, whether they're big examples or small wins, and then we have to celebrate them. That's key, um, celebrating, acknowledging when people do the right thing, that is how you're going to gain the momentum to slowly build your way up to get the things done you want to do. And a win for SRE is a win for the entire company.

00:28:34

So as Ari is a language and like any language, it needs to be taught patiently and methodically start by introducing the terms, but words are just noises stuck together, and they're easily misinterpreted. It's easy to confuse SLO and SLI. It's easy to mix them up with things like SLS, or even get them confused with the raw metrics that drive SLS, but aren't SLS and of themselves. As I've mentioned, error, budget can sound at first like this punitive term, but it's actually a form of permission gives us the opportunity to ship new functionality. So just saying a word won't make it stick. And it won't ensure that everyone hears it the same way we need to help people make personally resonant attachments to the language. We want people to have that feeling of we established an SRE practice and my job is less stressful. Now we can promise that all day, but in order to really believe it, they need to feel it for themselves.

00:29:31

This is why it's important to start small, get one team to fully participate in a transformation and reap the benefits. Then use that success to inspire the next team. And then repeat, we take those tangible successes, promote them, celebrate them encouraged by reality. Take the time to learn from each experience and continuously improve because it's a long journey to SRV. Nirvana. Google was working on these ideas for the better part of 10 years before we started sharing them externally. And we're still continuously revising teams regularly review and change their SLS as new data becomes available. And the SRE community at large is constantly revisiting and revising how we do SRE. And the final step is now there is no final step. The journey to site reliability engineering is never done, but each step can bring benefits for your customers and for your company. Each step makes life a little less toilsome for your team and makes your systems a little more reliable for your customers. Thank you, Adam, for sharing your story and thank you to everyone for joining us.

00:30:36

Thank you, Dave. Now we'd love to hear from you. So look for us on Twitter and let's start a conversation.